Purpose: A novel and unconventional approach to a machine learning challenge was designed to spread knowledge, identify robust methods and highlight potential pitfalls about machine learning within the Medical Physics community. Methods: A public dataset comprising 41 radiomic features and 535 patients was employed to assess the potential of radiomics in distinguishing between primary lung tumors and metastases. Each participant developed two classification models using: (i) all features (base model); (ii) only robust features (robust model). Both models were validated with cross-validation and on unseen data. The population stability index (PSI) was used as diagnostic metric for implementation issues. Performance was compared to reference. Base and robust models were compared in terms of performance and stability (coefficient of variation (CoV) of prediction probabilities). Results: PSI detected potential implementation errors in 70 % of models. The dataset exhibited strong imbalance. The average Gmean (i.e. an appropriate metric for imbalance) among all participants was 0.67 ± 0.01, significantly higher than reference Gmean of 0.50 ± 0.04. Robust models performances were slightly worse than base models (p < 0.05). Regarding stability, robust models exhibited lower median CoV on training set only. Conclusion: AI4MP-Challenge models overperformed the reference, significantly improving the Gmean. Exclusion of less-robust features did not improve model robustness and it should be avoided when confounding effects are absent. Other methods, like harmonization or data augmentation, should be evaluated. This study demonstrated how the collaborative effort to foster knowledge on machine learning among medical physicists, through interactive sessions and exchange of information among participants, can result in improved models.
Robust machine learning challenge: An AIFM multicentric competition to spread knowledge, identify common pitfalls and recommend best practice / Maddalo, Michele; Fanizzi, Annarita; Lambri, Nicola; Loi, Emiliano; Branchini, Marco; Lorenzon, Leda; Giuliano, Alessia; Ubaldi, Leonardo; Saponaro, Sara; Signoriello, Michele; Fadda, Federico; Belmonte, Gina; Giannelli, Marco; Talamonti, Cinzia; Iori, Mauro; Tangaro, Sabina; Massafra, Raffaella; Mancosu, Pietro; Avanzo, Michele. - In: PHYSICA MEDICA. - ISSN 1120-1797. - ELETTRONICO. - 127:(2024), pp. 104834.0-104834.0. [10.1016/j.ejmp.2024.104834]
Robust machine learning challenge: An AIFM multicentric competition to spread knowledge, identify common pitfalls and recommend best practice
Ubaldi, Leonardo;Talamonti, Cinzia;
2024
Abstract
Purpose: A novel and unconventional approach to a machine learning challenge was designed to spread knowledge, identify robust methods and highlight potential pitfalls about machine learning within the Medical Physics community. Methods: A public dataset comprising 41 radiomic features and 535 patients was employed to assess the potential of radiomics in distinguishing between primary lung tumors and metastases. Each participant developed two classification models using: (i) all features (base model); (ii) only robust features (robust model). Both models were validated with cross-validation and on unseen data. The population stability index (PSI) was used as diagnostic metric for implementation issues. Performance was compared to reference. Base and robust models were compared in terms of performance and stability (coefficient of variation (CoV) of prediction probabilities). Results: PSI detected potential implementation errors in 70 % of models. The dataset exhibited strong imbalance. The average Gmean (i.e. an appropriate metric for imbalance) among all participants was 0.67 ± 0.01, significantly higher than reference Gmean of 0.50 ± 0.04. Robust models performances were slightly worse than base models (p < 0.05). Regarding stability, robust models exhibited lower median CoV on training set only. Conclusion: AI4MP-Challenge models overperformed the reference, significantly improving the Gmean. Exclusion of less-robust features did not improve model robustness and it should be avoided when confounding effects are absent. Other methods, like harmonization or data augmentation, should be evaluated. This study demonstrated how the collaborative effort to foster knowledge on machine learning among medical physicists, through interactive sessions and exchange of information among participants, can result in improved models.I documenti in FLORE sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.