Tree-based models are a popular tool for predicting a response given a set of explanatory variables when the regression function is characterized by a certain degree of complexity. Sometimes, they are also used to identify important variables and for variable selection. We show that if the generating model contains chains of direct and indirect effects, then the typical variable importance measures suggest selecting as important mainly the background variables, which have a strong indirect effect, disregarding the variables that directly influence the response. This is attributable mainly to the variable choice in the first steps of the algorithm selecting the splitting variable and to the greedy nature of such search. This pitfall could be relevant when using tree-based algorithms for understanding the underlying generating process, for population segmentation and for causal inference.
A note on the interpretation of tree-based regression models / Gottard, Anna; Vannucci, Giulia; Marchetti, Giovanni Maria. - In: BIOMETRICAL JOURNAL. - ISSN 0323-3847. - STAMPA. - 62:(2020), pp. 6.1564-6.1573. [10.1002/bimj.201900195]
A note on the interpretation of tree-based regression models
Gottard, Anna
;Vannucci, Giulia;Marchetti, Giovanni Maria
2020
Abstract
Tree-based models are a popular tool for predicting a response given a set of explanatory variables when the regression function is characterized by a certain degree of complexity. Sometimes, they are also used to identify important variables and for variable selection. We show that if the generating model contains chains of direct and indirect effects, then the typical variable importance measures suggest selecting as important mainly the background variables, which have a strong indirect effect, disregarding the variables that directly influence the response. This is attributable mainly to the variable choice in the first steps of the algorithm selecting the splitting variable and to the greedy nature of such search. This pitfall could be relevant when using tree-based algorithms for understanding the underlying generating process, for population segmentation and for causal inference.File | Dimensione | Formato | |
---|---|---|---|
bimj.201900195.pdf
Accesso chiuso
Tipologia:
Pdf editoriale (Version of record)
Licenza:
Tutti i diritti riservati
Dimensione
882.4 kB
Formato
Adobe PDF
|
882.4 kB | Adobe PDF | Richiedi una copia |
I documenti in FLORE sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.