Artificial Intelligence (AI) will likely affect healthcare systems significantly and it could play a key role in clinical decision making in future. Deep learning and radiomics methods are extremely promising machine learning tools to analyze complex and high-dimensional medical images. But unfortunately, machine learning models that work with imaging data require massive amounts of data. For this reason, their implementation in healthcare settings remains limited, mostly due to the lack of very large datasets on which to test the generalizability and reliability of the trained models. Although many institutes are collaborating to produce publicly available datasets of medical images, the access to medical images remains limited and the small sample sizes and lack of diverse geographic areas hinder the generalizability and accuracy of developed solutions. Moreover, the process of data acquisition is severely limited by different challenges. These obstacles are mainly related to privacy regulations and the effort of domain experts to assess imaging data quality and produce high-quality ground truth. Medical data are often stored in disparate silos which in turn results in the difficulty of managing large medical imaging datasets. Furthermore, simply achieving access to large quantities of image data is insufficient to allay these shortcomings. Adequate curation, analysis, labeling, and clinical application are critical to achieving high-impact clinically meaningful AI algorithms. This Ph.D. thesis describes the process of labeling, curating, managing and sharing medical image data for AI algorithm development for optimal clinical impact, while maintaining a high degree of privacy and security in exchanging sensitive data. The pros and cons of having heterogeneous or homogeneous data have been taken into consideration. The first, caused by the diversity of the populations included in the dataset, leads to incompleteness for the different data acquisition standards and practices. The second, although it returns complete and uniform datasets, does not fully consider the natural variability of the population. This work provides an application of the various approaches proposed in the literature to alleviate the problem of small data samples in AI. Well-established techniques such as unsupervised hierarchical clustering and transfer learning in the context of rare diseases stratification have been analyzed. Moreover, a U-net was trained from scratch with the help of data augmentation merging public datasets while trying to contain data and label heterogeneity. The results are promising, showing that transfer learning technique can enable the training of custom models on small datasets by exploiting the powerful feature extraction modules of Convolutional Neural Networks. Different methods to select and combine features allow to incorporate more information and to reach high level of abstraction which in our case led to a natural clustering of data. Moreover, data augmentations combing different public dataset is also an effective technique to carry out a complete training. In clinical context, build effective models based on small data is an urgent task since machine learning systems allow the identification of extremely difficult correlations among medical imaging and clinical endpoint. This path is viable, there are the right tools to deal with it, but one need to know how to use them with full knowledge of the facts, adapting them to the needs of the case. This work has been developed in the framework of the INFN-funded AIM projects, that aims to exploit the expertise of INFN and associated researchers on medical data processing and enhancement, and turn it in the development of advanced and effective analysis instruments to be eventually clinically validated.
Dealing with Small Datasets in Artificial Intelligence: focus on Medical Imaging / Stefano Piffer. - (2023).
Dealing with Small Datasets in Artificial Intelligence: focus on Medical Imaging
Stefano Piffer
2023
Abstract
Artificial Intelligence (AI) will likely affect healthcare systems significantly and it could play a key role in clinical decision making in future. Deep learning and radiomics methods are extremely promising machine learning tools to analyze complex and high-dimensional medical images. But unfortunately, machine learning models that work with imaging data require massive amounts of data. For this reason, their implementation in healthcare settings remains limited, mostly due to the lack of very large datasets on which to test the generalizability and reliability of the trained models. Although many institutes are collaborating to produce publicly available datasets of medical images, the access to medical images remains limited and the small sample sizes and lack of diverse geographic areas hinder the generalizability and accuracy of developed solutions. Moreover, the process of data acquisition is severely limited by different challenges. These obstacles are mainly related to privacy regulations and the effort of domain experts to assess imaging data quality and produce high-quality ground truth. Medical data are often stored in disparate silos which in turn results in the difficulty of managing large medical imaging datasets. Furthermore, simply achieving access to large quantities of image data is insufficient to allay these shortcomings. Adequate curation, analysis, labeling, and clinical application are critical to achieving high-impact clinically meaningful AI algorithms. This Ph.D. thesis describes the process of labeling, curating, managing and sharing medical image data for AI algorithm development for optimal clinical impact, while maintaining a high degree of privacy and security in exchanging sensitive data. The pros and cons of having heterogeneous or homogeneous data have been taken into consideration. The first, caused by the diversity of the populations included in the dataset, leads to incompleteness for the different data acquisition standards and practices. The second, although it returns complete and uniform datasets, does not fully consider the natural variability of the population. This work provides an application of the various approaches proposed in the literature to alleviate the problem of small data samples in AI. Well-established techniques such as unsupervised hierarchical clustering and transfer learning in the context of rare diseases stratification have been analyzed. Moreover, a U-net was trained from scratch with the help of data augmentation merging public datasets while trying to contain data and label heterogeneity. The results are promising, showing that transfer learning technique can enable the training of custom models on small datasets by exploiting the powerful feature extraction modules of Convolutional Neural Networks. Different methods to select and combine features allow to incorporate more information and to reach high level of abstraction which in our case led to a natural clustering of data. Moreover, data augmentations combing different public dataset is also an effective technique to carry out a complete training. In clinical context, build effective models based on small data is an urgent task since machine learning systems allow the identification of extremely difficult correlations among medical imaging and clinical endpoint. This path is viable, there are the right tools to deal with it, but one need to know how to use them with full knowledge of the facts, adapting them to the needs of the case. This work has been developed in the framework of the INFN-funded AIM projects, that aims to exploit the expertise of INFN and associated researchers on medical data processing and enhancement, and turn it in the development of advanced and effective analysis instruments to be eventually clinically validated.File | Dimensione | Formato | |
---|---|---|---|
thesis_Piffer.pdf
Open Access dal 01/01/2025
Tipologia:
Tesi di dottorato
Licenza:
Creative commons
Dimensione
16.98 MB
Formato
Adobe PDF
|
16.98 MB | Adobe PDF |
I documenti in FLORE sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.