Inferring an individual’s BioGeographical Ancestry (BGA) through DNA analysis is a valuable tool in various fields such as forensic science, especially when traditional methods fail to identify suspects or victims. Advances in Next-Generation Sequencing (NGS) have revolutionized genomic data acquisition, enabling the development of comprehensive Single Nucleotide Polymorphism (SNP) panels for ancestry inference. This study assessed the effectiveness of a novel panel containing 3,234 SNPs at both inter-continental and a more detailed BGA level, using various supervised Machine Learning (ML) models, including Categorical Naive Bayes, Penalized Multinomial Logistic Regression, Linear Support Vector Machines, Random Forest, and tree-based Gradient Boosting. A nested cross-validation approach was employed for model tuning and evaluation, with balanced accuracy as the main performance metric to address class imbalance. At the inter-continental level, all ML models demonstrated high balanced accuracy, confirming their reliability for BGA inference. However, performance declined at the more detailed continental level, likely due to a combination of factors including increased class imbalance, reduced sample sizes for certain populations, and the inherent complexity of distinguishing genetically and geographically proximate groups. Nonetheless, promising results were observed for South Asians, Northeast Asians, Europeans, and West Africans classes. In contrast, performance was notably lower for underrepresented classes such as Inner Asians. Misclassification patterns at both levels appeared to reflect known geographical and historical relationships, although further analysis revealed that these were often concentrated in underrepresented or genetically complex groups. These findings highlighted the potential of this SNP panel and ML approaches as valuable tools for forensic investigations.
Predictive modeling of biogeographical ancestry using a novel SNP panel and supervised learning approaches / Grazzini, Cosimo; Spera, Giorgia; Morelli, Stefania; Castellana, Daniele; Cosenza, Giulia; Baccini, Michela; Cereda, Giulia; Pilli, Elena. - In: EXPERT SYSTEMS WITH APPLICATIONS. - ISSN 0957-4174. - ELETTRONICO. - (2025), pp. 0-0. [10.1016/j.eswa.2025.129662]
Predictive modeling of biogeographical ancestry using a novel SNP panel and supervised learning approaches
Grazzini, Cosimo;Spera, Giorgia;Morelli, Stefania;Castellana, Daniele;Cosenza, Giulia;Baccini, Michela;Cereda, Giulia
;Pilli, Elena
2025
Abstract
Inferring an individual’s BioGeographical Ancestry (BGA) through DNA analysis is a valuable tool in various fields such as forensic science, especially when traditional methods fail to identify suspects or victims. Advances in Next-Generation Sequencing (NGS) have revolutionized genomic data acquisition, enabling the development of comprehensive Single Nucleotide Polymorphism (SNP) panels for ancestry inference. This study assessed the effectiveness of a novel panel containing 3,234 SNPs at both inter-continental and a more detailed BGA level, using various supervised Machine Learning (ML) models, including Categorical Naive Bayes, Penalized Multinomial Logistic Regression, Linear Support Vector Machines, Random Forest, and tree-based Gradient Boosting. A nested cross-validation approach was employed for model tuning and evaluation, with balanced accuracy as the main performance metric to address class imbalance. At the inter-continental level, all ML models demonstrated high balanced accuracy, confirming their reliability for BGA inference. However, performance declined at the more detailed continental level, likely due to a combination of factors including increased class imbalance, reduced sample sizes for certain populations, and the inherent complexity of distinguishing genetically and geographically proximate groups. Nonetheless, promising results were observed for South Asians, Northeast Asians, Europeans, and West Africans classes. In contrast, performance was notably lower for underrepresented classes such as Inner Asians. Misclassification patterns at both levels appeared to reflect known geographical and historical relationships, although further analysis revealed that these were often concentrated in underrepresented or genetically complex groups. These findings highlighted the potential of this SNP panel and ML approaches as valuable tools for forensic investigations.I documenti in FLORE sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.



