We extend the knockoffs method for selecting predictors to clustered data (cross-sectional or repeated measures). In this setting, variable selection is complex because some predictors are measured at the observation level (level 1), whereas others are measured at the cluster level (level 2), so their values are constant within clusters. Moreover, level 1 predictors are correlated within clusters. The solution we propose is to conduct variable selection separately at the two levels. To this end, we suggest a two-step approach: (i) decompose each level 1 predictor into level 2 and level 1 components by replacing it with the cluster mean and the deviation from the cluster mean; (ii) perform variable selection separately at the two levels, where the level 1 data matrix includes the deviations from the cluster means and the level 2 data matrix includes the level 2 predictors and the cluster means of level 1 predictors. To evaluate the performance of the proposed approach, we conduct a simulation study comparing the sequential knockoff, the derandomised knockoff, and the Lasso. The study shows satisfactory results for false discovery rate and power. All methods fail when applied to the complete data matrix, including both level 1 and level 2 predictors. In contrast, all methods perform better when applied to the level 1 and level 2 data matrices separately. Moreover, the sequential knockoffs method performs substantially better than the Lasso and the derandomised knockoffs. Our proposal to implement the knockoffs method in a clustered data framework is feasible, flexible, and effective.

Variable selection via knockoffs for clustered data / Silvia Bacci, L.G.. - In: ADVANCES IN DATA ANALYSIS AND CLASSIFICATION. - ISSN 1862-5355. - ELETTRONICO. - (2026), pp. 1-19. [10.1007/s11634-026-00682-9]

Variable selection via knockoffs for clustered data

Silvia Bacci;Leonardo Grilli;Ersilia Lucenteforte;Carla Rampichini
2026

Abstract

We extend the knockoffs method for selecting predictors to clustered data (cross-sectional or repeated measures). In this setting, variable selection is complex because some predictors are measured at the observation level (level 1), whereas others are measured at the cluster level (level 2), so their values are constant within clusters. Moreover, level 1 predictors are correlated within clusters. The solution we propose is to conduct variable selection separately at the two levels. To this end, we suggest a two-step approach: (i) decompose each level 1 predictor into level 2 and level 1 components by replacing it with the cluster mean and the deviation from the cluster mean; (ii) perform variable selection separately at the two levels, where the level 1 data matrix includes the deviations from the cluster means and the level 2 data matrix includes the level 2 predictors and the cluster means of level 1 predictors. To evaluate the performance of the proposed approach, we conduct a simulation study comparing the sequential knockoff, the derandomised knockoff, and the Lasso. The study shows satisfactory results for false discovery rate and power. All methods fail when applied to the complete data matrix, including both level 1 and level 2 predictors. In contrast, all methods perform better when applied to the level 1 and level 2 data matrices separately. Moreover, the sequential knockoffs method performs substantially better than the Lasso and the derandomised knockoffs. Our proposal to implement the knockoffs method in a clustered data framework is feasible, flexible, and effective.
2026
1
19
Silvia Bacci, Leonardo Grilli, Ersilia Lucenteforte, Carla Rampichini
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in FLORE sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificatore per citare o creare un link a questa risorsa: https://hdl.handle.net/2158/1473952
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact