General Purpose GPUs (GPGPUs) are highly susceptible to both transient and permanent faults. This is a serious concern for their safe and reliable usage in many domains, from autonomous driving to High Performance Computing. The research and industrial community responded fiercely to this issue, by analyzing failures impact and devising failure mitigation strategies. This led to the definition of several failure modes and mitigation approaches. Unfortunately, these are often based on different foundations, and it is not easy to position them in a consistent view. This work elaborates a GPGPU failures model, identifying relations between the GPGPU failure modes and components, and then it analyzes mitigations proposed in the literature. By proposing a unified view on failures and mitigations, the resulting model i) positions each research on the subject, ii) easily identifies the current gaps, and iii) sets the basis for further research on GPGPU failures.
Failure modes and failure mitigation in GPGPUs: a reference model and its application / Terrosi F.; Ceccarelli A.; Bondavalli A.. - ELETTRONICO. - (2022), pp. 62-72. (Intervento presentato al convegno 46th IEEE Annual Computers, Software, and Applications Conference, COMPSAC 2022 tenutosi a usa nel 2022) [10.1109/COMPSAC54236.2022.00018].
Failure modes and failure mitigation in GPGPUs: a reference model and its application
Terrosi F.;Ceccarelli A.;Bondavalli A.
2022
Abstract
General Purpose GPUs (GPGPUs) are highly susceptible to both transient and permanent faults. This is a serious concern for their safe and reliable usage in many domains, from autonomous driving to High Performance Computing. The research and industrial community responded fiercely to this issue, by analyzing failures impact and devising failure mitigation strategies. This led to the definition of several failure modes and mitigation approaches. Unfortunately, these are often based on different foundations, and it is not easy to position them in a consistent view. This work elaborates a GPGPU failures model, identifying relations between the GPGPU failure modes and components, and then it analyzes mitigations proposed in the literature. By proposing a unified view on failures and mitigations, the resulting model i) positions each research on the subject, ii) easily identifies the current gaps, and iii) sets the basis for further research on GPGPU failures.File | Dimensione | Formato | |
---|---|---|---|
Failure_modes_and_failure_mitigation_in_GPGPUs_a_reference_model_and_its_application.pdf
Accesso chiuso
Tipologia:
Pdf editoriale (Version of record)
Licenza:
Tutti i diritti riservati
Dimensione
498.77 kB
Formato
Adobe PDF
|
498.77 kB | Adobe PDF | Richiedi una copia |
I documenti in FLORE sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.