General Purpose GPUs (GPGPUs) are highly susceptible to both transient and permanent faults. This is a serious concern for their safe and reliable usage in many domains, from autonomous driving to High Performance Computing. The research and industrial community responded fiercely to this issue, by analyzing failures impact and devising failure mitigation strategies. This led to the definition of several failure modes and mitigation approaches. Unfortunately, these are often based on different foundations, and it is not easy to position them in a consistent view. This work elaborates a GPGPU failures model, identifying relations between the GPGPU failure modes and components, and then it analyzes mitigations proposed in the literature. By proposing a unified view on failures and mitigations, the resulting model i) positions each research on the subject, ii) easily identifies the current gaps, and iii) sets the basis for further research on GPGPU failures.

Failure modes and failure mitigation in GPGPUs: a reference model and its application / Terrosi F.; Ceccarelli A.; Bondavalli A.. - ELETTRONICO. - (2022), pp. 62-72. ((Intervento presentato al convegno 46th IEEE Annual Computers, Software, and Applications Conference, COMPSAC 2022 tenutosi a usa nel 2022 [10.1109/COMPSAC54236.2022.00018].

Failure modes and failure mitigation in GPGPUs: a reference model and its application

Terrosi F.;Ceccarelli A.;Bondavalli A.
2022

Abstract

General Purpose GPUs (GPGPUs) are highly susceptible to both transient and permanent faults. This is a serious concern for their safe and reliable usage in many domains, from autonomous driving to High Performance Computing. The research and industrial community responded fiercely to this issue, by analyzing failures impact and devising failure mitigation strategies. This led to the definition of several failure modes and mitigation approaches. Unfortunately, these are often based on different foundations, and it is not easy to position them in a consistent view. This work elaborates a GPGPU failures model, identifying relations between the GPGPU failure modes and components, and then it analyzes mitigations proposed in the literature. By proposing a unified view on failures and mitigations, the resulting model i) positions each research on the subject, ii) easily identifies the current gaps, and iii) sets the basis for further research on GPGPU failures.
Proceedings - 2022 IEEE 46th Annual Computers, Software, and Applications Conference, COMPSAC 2022
46th IEEE Annual Computers, Software, and Applications Conference, COMPSAC 2022
usa
2022
Terrosi F.; Ceccarelli A.; Bondavalli A.
File in questo prodotto:
File Dimensione Formato  
Failure_modes_and_failure_mitigation_in_GPGPUs_a_reference_model_and_its_application.pdf

Accesso chiuso

Tipologia: Pdf editoriale (Version of record)
Licenza: DRM non definito
Dimensione 498.77 kB
Formato Adobe PDF
498.77 kB Adobe PDF   Visualizza/Apri   Richiedi una copia

I documenti in FLORE sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2158/1281963
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? 0
social impact