The Mediterranean region is experiencing rising temperatures and decreasing water availability, threatening durum wheat (Triticum durum Desf.), a crop essential for food security. To reduce climate-related yield and quality losses, it is critical to understand how these traits respond to environmental variability. This study combines phenotypic, climatic, and genomic data using machine learning to model durum wheat yield and grain protein content, aiming to identify high-performing varieties under specific climates and uncover genetic factors for climate-resilient breeding strategies. Phenotypic data from over 200 T. durum varieties cultivated at 119 locations across Italy over a 24-year period (~30,000 observations) were analyzed. These were paired with monthly ERA5 climate variables recorded from tillering to harvest. The environmental data included total and average rainfall, growing degree days (base 5°C), solar radiation, air relative humidity, soil moisture at multiple depths, wind speed, number of consecutive heatwave days, and temperature metrics (mean, max, min), allowing for detailed characterization of growing conditions. Genomic data were obtained from the 90K iSelect SNP array, incorporating both varietal SNPs (distinguishing between different cultivars) and interhomologous SNPs (capturing variation between homoeologous genomes in tetraploid wheat). Four machine learning algorithms RF, SVM, GBM, and BART were trained to predict yield and protein content. Model performance was evaluated using RMSE, MAE, and R², with cross-validation and hyperparameter tuning ensuring robustness. While all models showed similar predictive power, RF was selected for its balanced accuracy, interpretability, and stable residuals key features for subsequent genomic analysis. Using climate-based RF models, we predicted yield and protein content. The residuals (observed minus predicted values), averaged by variety, represent phenotypic variation unexplained by environmental variables and were used as climate-adjusted traits for genome-wide association studies (GWAS). GWAS was performed using both traditional statistical methods and a Random Forest-based approach, with the latter estimating SNP significance via Altmann permutation, a non-parametric method suited for complex traits. This dual-GWAS framework enabled a comparative analysis of classical and machine learning-based results, offering complementary insights into the genetic architecture of yield stability and grain protein concentration. In conclusion, integrating machine learning with genomic data enabled accurate modeling of durum wheat performance and identification of genetic markers linked to climate-independent traits. This framework offers a useful approach for supporting the development of varieties suited to specific agro-climatic zones and potentially more resilient to future environmental challenges, contributing to ongoing efforts in applying data-driven methods to agricultural research.

Uncovering genetic drivers of yield and protein content in durum wheat (Triticum durum Desf.) using machine learning and GWAS approaches / VIERI W.*, F. GRINBERG N.***, I. ORHOBOR O.***, BELOCCHI A.****, BUTI M.**, PAFFETTI D.**. - ELETTRONICO. - (2025), pp. 0-0. (Intervento presentato al convegno LEVERAGING GENETIC INNOVATION FOR FUTURE-PROOFING CROPS tenutosi a Viterbo nel 09-12 September, 2025).

Uncovering genetic drivers of yield and protein content in durum wheat (Triticum durum Desf.) using machine learning and GWAS approaches

VIERI W.
;
BUTI M.;PAFFETTI D.
2025

Abstract

The Mediterranean region is experiencing rising temperatures and decreasing water availability, threatening durum wheat (Triticum durum Desf.), a crop essential for food security. To reduce climate-related yield and quality losses, it is critical to understand how these traits respond to environmental variability. This study combines phenotypic, climatic, and genomic data using machine learning to model durum wheat yield and grain protein content, aiming to identify high-performing varieties under specific climates and uncover genetic factors for climate-resilient breeding strategies. Phenotypic data from over 200 T. durum varieties cultivated at 119 locations across Italy over a 24-year period (~30,000 observations) were analyzed. These were paired with monthly ERA5 climate variables recorded from tillering to harvest. The environmental data included total and average rainfall, growing degree days (base 5°C), solar radiation, air relative humidity, soil moisture at multiple depths, wind speed, number of consecutive heatwave days, and temperature metrics (mean, max, min), allowing for detailed characterization of growing conditions. Genomic data were obtained from the 90K iSelect SNP array, incorporating both varietal SNPs (distinguishing between different cultivars) and interhomologous SNPs (capturing variation between homoeologous genomes in tetraploid wheat). Four machine learning algorithms RF, SVM, GBM, and BART were trained to predict yield and protein content. Model performance was evaluated using RMSE, MAE, and R², with cross-validation and hyperparameter tuning ensuring robustness. While all models showed similar predictive power, RF was selected for its balanced accuracy, interpretability, and stable residuals key features for subsequent genomic analysis. Using climate-based RF models, we predicted yield and protein content. The residuals (observed minus predicted values), averaged by variety, represent phenotypic variation unexplained by environmental variables and were used as climate-adjusted traits for genome-wide association studies (GWAS). GWAS was performed using both traditional statistical methods and a Random Forest-based approach, with the latter estimating SNP significance via Altmann permutation, a non-parametric method suited for complex traits. This dual-GWAS framework enabled a comparative analysis of classical and machine learning-based results, offering complementary insights into the genetic architecture of yield stability and grain protein concentration. In conclusion, integrating machine learning with genomic data enabled accurate modeling of durum wheat performance and identification of genetic markers linked to climate-independent traits. This framework offers a useful approach for supporting the development of varieties suited to specific agro-climatic zones and potentially more resilient to future environmental challenges, contributing to ongoing efforts in applying data-driven methods to agricultural research.
2025
Proceedings of the LXVIII SIGA Annual Congress
LEVERAGING GENETIC INNOVATION FOR FUTURE-PROOFING CROPS
Viterbo
VIERI W.*, F. GRINBERG N.***, I. ORHOBOR O.***, BELOCCHI A.****, BUTI M.**, PAFFETTI D.**
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in FLORE sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificatore per citare o creare un link a questa risorsa: https://hdl.handle.net/2158/1436455
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact