A sequenced process of Fault Detection followed by the erroneous node’s Isolation and system Reconfiguration (node exclusion or recovery), that is, the FDIR process, characterizes the sustained operations of a fault-tolerant system. For distributed systems utilizing message passing, a number of diagnostic (and associated FDIR) approaches, including our prior algorithms, exist in literature and practice. Invariably, the focus is on proving the completeness and correctness (all and only the faulty nodes are isolated) for the chosen fault model, without explicitly segregating permanent from transient faulty nodes. To capture diagnostic issues related to the persistence of errors (transient, intermittent, and permanent), we advocate the integration of count-and-threshold mechanisms into the FDIR framework. Targeting pragmatic system issues, we develop an adaptive online FDIR framework that handles a continuum of fault models and diagnostic protocols and comprehensively characterizes the role of various probabilistic parameters that, due to the count-and-threshold approach, influence the correctness and completeness of diagnosis and system reliability such as the fault detection frequency. The FDIR framework has been implemented on two prototypes for automotive and aerospace applications. The tuning of the protocol parameters at design time allows a significant improvement with respect to prior design choices.

Online Diagnosis and Recovery: On the Choice and Impact of Tuning Parameters / M. Serafini; A. Bondavalli; N. Suri. - In: IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING. - ISSN 1545-5971. - STAMPA. - 4:(2007), pp. 295-312. [10.1109/TDSC.2007.70210]

Online Diagnosis and Recovery: On the Choice and Impact of Tuning Parameters

BONDAVALLI, ANDREA;
2007

Abstract

A sequenced process of Fault Detection followed by the erroneous node’s Isolation and system Reconfiguration (node exclusion or recovery), that is, the FDIR process, characterizes the sustained operations of a fault-tolerant system. For distributed systems utilizing message passing, a number of diagnostic (and associated FDIR) approaches, including our prior algorithms, exist in literature and practice. Invariably, the focus is on proving the completeness and correctness (all and only the faulty nodes are isolated) for the chosen fault model, without explicitly segregating permanent from transient faulty nodes. To capture diagnostic issues related to the persistence of errors (transient, intermittent, and permanent), we advocate the integration of count-and-threshold mechanisms into the FDIR framework. Targeting pragmatic system issues, we develop an adaptive online FDIR framework that handles a continuum of fault models and diagnostic protocols and comprehensively characterizes the role of various probabilistic parameters that, due to the count-and-threshold approach, influence the correctness and completeness of diagnosis and system reliability such as the fault detection frequency. The FDIR framework has been implemented on two prototypes for automotive and aerospace applications. The tuning of the protocol parameters at design time allows a significant improvement with respect to prior design choices.
2007
4
295
312
M. Serafini; A. Bondavalli; N. Suri
File in questo prodotto:
File Dimensione Formato  
IEEETDSC07.pdf

Accesso chiuso

Tipologia: Versione finale referata (Postprint, Accepted manuscript)
Licenza: Tutti i diritti riservati
Dimensione 3.73 MB
Formato Adobe PDF
3.73 MB Adobe PDF   Richiedi una copia

I documenti in FLORE sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificatore per citare o creare un link a questa risorsa: https://hdl.handle.net/2158/316327
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 25
  • ???jsp.display-item.citation.isi??? 19
social impact