Prediagnostic circulating metabolites in female breast cancer cases with low and high mammographic breast density

Mammographic breast density (MBD) is a strong independent risk factor for breast cancer (BC). We designed a matched case–case study in the EPIC Florence cohort, to evaluate possible associations between the pre-diagnostic metabolomic profile and the risk of BC in high- versus low-MBD women who developed BC during the follow-up. A case–case design with 100 low-MBD (MBD ≤ 25%) and 100 high-MDB BC cases (MBD > 50%) was performed. Matching variables included age, year and type of mammographic examination. 1H NMR metabolomic spectra were available for 87 complete case–case sets. The conditional logistic analyses showed an inverse association between serum levels of alanine, leucine, tyrosine, valine, lactic acid, pyruvic acid, triglycerides lipid main fraction and 11 VLDL lipid subfractions and high-MBD cases. Acetic acid was directly associated with high-MBD cases. In models adjusted for confounding variables, tyrosine remained inversely associated with high-MBD cases while 3 VLDL subfractions of free cholesterol emerged as directly associated with high-MBD cases. A pathway analysis showed that the “phenylalanine, tyrosine and tryptophan pathway” emerged and persisted after applying the FDR procedure. The supervised OPLS-DA analysis revealed a slight but significant separation between high- and low-MBD cases. This case–case study suggested a possible role for pre-diagnostic levels of tyrosine in modulating the risk of BC in high- versus low-MBD women. Moreover, some differences emerged in the pre-diagnostic concentration of other metabolites as well in the metabolomic fingerprints among the two groups of patients.


Methods
Study cohort. The European Prospective Investigation into Cancer and nutrition (EPIC) Florence cohort has been set up as a part of the EPIC European prospective study and enrolled (between 1993 and 1998) 10,083 clinically healthy women aged 35-64 years residing in the Florence area (Tuscany, Central Italy). All study participants signed an informed consent and gave permission to use the data collected during the study. The study was approved by the local Ethics Committee "Azienda Sanitaria Firenze". All procedures performed were in accordance with the ethical standards of the institutional and national research committee and with the 1964 Helsinki declaration and its later amendments or comparable ethical standards.
At enrolment, weight, height, waist and hip circumferences were measured by trained nurses according to an international standard protocol. Data on frequency of consumption of 188 foods and drinks and usual portion size were obtained through a validated self-administered Food Frequency Questionnaire specifically developed to capture the Italian dietary habits. A standardized lifestyle questionnaire collected detailed information on reproductive history, smoking and alcohol drinking history, educational level, physical activity habits and medical history. Information on drug use including hormone replacement therapy (HRT) was also collected. Following a standardized protocol, a fasting blood sample was collected for every participant, processed, aliquoted and stored in the liquid nitrogen biological bank of the study, for long-term storage 32 .
The ascertainment of vital status was carried out through the linkage with the local town offices and the local Mortality Registry, thereby identifying the deceased subjects and the date and cause of death. Standardized follow-up procedures have been periodically implemented for the identification of cancer cases diagnosed after enrolment. The identification of BC cases (code C50 according to ICD-O-2 classification) was obtained through periodical linkage with the hospital discharge system and the Pathology Department registries 32 . At the 31/12/2015 follow up, 573 BC cases have been identified in the EPIC Florence cohort. Information on oestrogen receptor (ER) and progesterone receptor (PR) status was provided on the basis of pathology reports. Two categories (negative/positive) were considered according to well-established cut-off values (10% for both ER and PR) 33,34 .
In order to update the mammographic examination (ME) history of the EPIC female participants, we performed periodically a linkage with the mammographic archives of the population-based local mammographic screening (run by ISPRO, Florence) and of the MEs performed in a clinical setting at our Institution 35 . For each newly identified BC case we retrieved a negative ME performed at least one year before the BC diagnosis, if available, or otherwise the diagnostic ME. All MEs were revised by the study radiologist (DA) and classified according to the 4th Breast Imaging Reporting and Data System (BI-RADS) criteria: D1 < 25%, D2 = 25-50%, D3 = 51-75%, D4 > 75% of the area of the breast showing fibroglandular density. Overall, 481 out of the 573 identified BC cases have been classified according to the 4 BI-RADS categories 36 .
Design of the nested case-case study. A case-case design was used to compare the pre-diagnostic metabolites' concentrations among low-MDB women who developed a BC (low-MBD cases) and among high-MDB women who also developed a BC (high-MBD cases).
A 1:1 case-case study was set up by selection of 100 high-MDB cases (MBD > 50%, BI-RADS = D3 or D4) and 100 low-MBD cases (MBD < 25%, BI-RADS = D1) matched by age at cohort entry (± 5 years), characteristics of the ME used to classify the MBD (analogical/digital; negative/diagnostic) and year of ME (before/after 31 st December, 1999). The population included in the present study consisted of all pairs that could be obtained through the above described procedure.
A total of 194 serum samples (97 complete case-case sets) were retrieved from the liquid nitrogen biological bank of the study and shipped to the study laboratory for the metabolomic profile examination. Metabolomic spectra were available for 174 serum samples corresponding to 87 complete case-case sets. Ten case-case sets were excluded because serum samples were of insufficient quality for metabolomics analysis (i.e. haemolyzed). Serum  www.nature.com/scientificreports/ tion of all molecules present in concentrations above the detection limit) spectra were acquired. Samples were prepared and NMR spectra acquired following standard procedures 22 . Free induction decays were multiplied by an exponential function equivalent to a 0.3 Hz line-broadening factor before applying Fourier transform. Transformed spectra were automatically corrected for phase and baseline distortions and calibrated at the anomeric glucose signal at 5.24 ppm using TopSpin 3.2.

Statistical analysis.
Main baseline characteristics of BC cases were described separately for high-and low-MBD. Means, standard deviations and p-values from t test or Wilcoxon rank-sum tests were performed for continuous variables. Frequencies and Pearson's chi-squared tests were performed for categorical variables. Quantification of metabolites, lipid main fractions and subfractions was performed using the Bruker IVDr platform 38 . Completeness of measures and limits of quantification (LOQ) are shown in Supplementary Table 1. Values lower than the limit of quantification (LOQ) were imputed with half the LOQ. Metabolites with more than 20% of observation under the LOQ (n = 6) were excluded from the statistical analyses 39 .
Means of metabolite concentrations in high-and low-MBD cases were computed. Metabolites, lipid main fractions and lipoprotein subfraction concentration values were log-transformed in order to normalize the distribution. Conditional logistic regression models were performed to estimate the association between metabolites, lipid main fractions and lipoprotein subfractions concentration and being a high-MBD case. Each single metabolite, lipid main fraction and lipoprotein subfraction (continuous, per standard deviation) was separately added to the model.
Additional models were performed in order to adjust for a set of potential confounding variables mainly related to MBD modulation (age at diagnosis, baseline menopausal status, number of full-term pregnancies, ER status, breastfeeding and baseline body mass index class,). Further models were also performed adjusting for waist/hip ratio, diabetes, hypertension and hyperlipidaemia. p values were adjusted for multiple testing using the false discovery rate (FDR) procedure with Benjamini-Hochberg correction at α = 0.05 40 . STATA 14.1 software was used for these analyses.
MetaboAnalyst 41,42 was used to analyse the involved metabolic pathways related to the identified metabolites. The metabolic pathways analysis was conducted on the metabolites showing a significant association in conditioned logistic models, with the exclusion of lipid fractions that were not directly matchable with MetaboAnalyst. According to previous studies, only pathways with an impact > 0.2 were considered 43 .
To perform the multivariate analysis on the NMR spectra, each 1D spectrum in the range 0.2-10.00 ppm (thus the whole spectra, considering both assigned and unassigned metabolites) was segmented into 0.02 ppm chemical shift bins and the corresponding spectral areas were integrated using AMIX software (version 3.8.4, Bruker BioSpin). The region between 5.12 and 4.40 ppm containing the residual water signal was removed and the dimension of the system was reduced to 455 bins. The total spectral area was calculated on the remaining bins and total integral normalization was carried out prior to pattern recognition.
Unsupervised Principal Component Analysis (PCA) was used as first exploratory analysis to visualize the data and to discover possible outliers. Differences in the serum metabolomic fingerprints were then assessed using a supervised Orthogonal Partial Least Squares Discrimination Analysis (OPLS-DA) to cluster the groups of interest. In each OPLS-DA model the minimum number of latent variables that maximize model accuracy was retained (CPMG n = 7; NOESY n = 9; Diffusion n = 6). Accuracy, sensitivity and specificity for the OPLS-DA classifications were assessed by means of 100 cycles of a Monte Carlo cross-validation scheme (MCCV, R script in-house developed). Briefly, 90% of the data were randomly chosen at each iteration as a training set to build the model, the remaining 10% was tested and sensitivity, specificity and accuracy for the classification were assessed according to the standard definition. Significance of the classification results was assessed by means of a permutation test using 10 2 permutations.
BC diagnosis occurred on average 8.6 and 8.2 years after blood sample collection in low-and high-MBD cases, respectively (p = 0.65). Mean age at diagnosis was significantly lower among high-MBD cases (62.9 and 59.8 years in low-and high-MBD cases, respectively, p = 0.002). High-MBD cases also showed a lower number of pregnancies and of breastfeeding months. Moreover low-MBD cases were mainly among post-menopausal women and among women with a higher body mass index (Table 1).
Logistic models conditioned on the matching variables showed that 6 out of the 15 metabolites were inversely associated with high-MBD BC cases: alanine (OR 0.59, 95%CI 0. 42 (Table 3). Results did not change after further adjustment for waist/hip ratio, diabetes, hyperlipidaemia and hypertension (data not shown).
None of the examined molecules remained associated, in adjusted models, after controlling for multiple tests by FDR.
The results of the pathway analysis are presented graphically in Fig. 1. A total of 17 pathways were detected related to the 7 metabolites significantly associated with high-MBD cases in unadjusted logistic models conditioned on the matching variables. Two pathways emerged with an impact > 0.2. The first pathway was the "phenylalanine, tyrosine and tryptophan biosynthesis" (significant FDR adjusted p value), with 4 total compounds including 1 Hit corresponding to tyrosine. The second pathway was the "pyruvate metabolism" (non significant FDR adjusted p value). This pathway included 22 total compounds among which we had 3 Hits corresponding to pyruvic acid, lactic acid and acetic acid).
The PCA performed on the 87 case-case sets showed no outliers ( Supplementary Fig. 1). Results of the supervised OPLS-DA for all the three types of NMR spectra (score plots: Fig. 2, loading plots: Supplementary Fig. 2) revealed a slight but significant separation between BC cases with high and low MBD (accuracy 61.2-62.6%, p-value < 0.05). The spectral regions that mainly contribute to the discrimination between high and low MBD women in the OPLS-DA models are those related to VLDL lipoproteins. Regarding low molecular weight metabolites, detected only in NOESY and CPMG spectra, the bins of alanine, valine, 3-hydroxybutyrate, pyruvate and lactate showed the highest discriminating power.

Discussion
This study aimed to evaluate the possible differential role of individual pre-diagnostic metabolomic profiles in two matched series of BC cases identified in participants with low or high MBD. Pre-diagnostic serum samples from BC cases were examined in the frame of a case-case study nested in the EPIC Florence cohort. Since all study subjects developed BC, the investigated association is not to be interpreted as a BC risk assessment related to pre-diagnostic serum metabolites, but rather as an estimation of the possible differential effect of pre-diagnostic metabolomic profiles in the modulation of the risk to develop a BC in women with high vs low MBD.
As expected, a lower mean age at diagnosis emerged among high-MBD BC cases compared to their matched low-MBD BC cases. A reduction in MBD with aging has been extensively reported in literature 6 . The parameters associated with the reproductive history were also in line with the evidences reported in literature [5][6][7] . High-MBD cases occurred mainly in pre-menopause women and in women with a lower number of pregnancies.
As reported in literature 8 , in our study body mass index, body weight and waist and hip circumferences were significantly lower among high-MBD BC cases compared to low-MBD BC cases.
Results of the conditional logistic analyses showed an inverse association between serum levels of six metabolites and high-MBD BC cases, as well as for serum levels of the triglycerides lipid main fraction and 11 VLDL subfractions of triglycerides, cholesterol, phospholipids and APO B. One other metabolite was, on the other hand, directly associated with high-MBD BC cases. After adjustment for age at diagnosis, menopausal status, number of full-term pregnancies, breastfeeding, ER status and body mass index, tyrosine confirmed the significant inverse association with high-MBD BC cases and an association emerged between 3 VLDL subfractions of free cholesterol and high-MBD BC cases, although the reported associations lose significance after checking for multiple tests.
Alterations of amino acid levels in plasma or serum samples of breast cancer patients as compared with healthy controls were investigated by some studies with contradictory results 44 . Both tyrosine and alanine not only were found higher in BC patients, but their levels seem to be also influenced by the stage of the disease 45 .  www.nature.com/scientificreports/ We observed higher levels of alanine and tyrosine in pre-diagnostic serum samples of low MDB cases in comparison to high MDB cases, although in our series no differences emerged in tumor characteristics of low-and high-MBD matched BCs. In our study three free cholesterol VLDL subfractions showed a direct association with with high-MBD cases in adjusted models. Notably, our results remained significant in models adjusted for diabetes, hyperlipidaemia and hypertension. Obesity, overweight and dyslipidaemia are considered risk factors for BC, especially in postmenopausal women. However, the mechanisms in which they are involved, and therefore their role in BC development and growth, remain controversial probably due to different experimental settings 46,47 . Clinical studies and meta-analyses support a role for obesity, dietary fat intake and cholesterol in the onset and progression of BC, while some studies show that high cholesterol levels prior to diagnosis protect against the development of these tumors 46 . In a previous in vitro study LDL subfractions 1 and 5, VLDL, but not HDL, enhanced BC cell viability, increased the in vitro tumorigenesis, promoted BC cell migration and invasion and promoted angiogenic activity. However, only VLDL promoted metastasis in nude mice 47 . Moreover, in a previous in-vitro study, VLDL was associated with transport capacity of lipids to cancer cells to support breast cancer growth and development 48 .
Based on the pathway analysis the "phenylalanine, tyrosine and tryptophan biosynthesis" emerged as the most important metabolic process showing a differential expression among low-and high-MBD BC cases, that persisted after FDR testing. In a paper from Chen et al. 49 , tyrosine metabolism emerged as one of the most relevant dysfunctional pathways in aggressive cancer cell lines and an interaction between cancer related pathways and tyrosine metabolism was reported. Our pathway analyses also showed a possible role of the "pyruvate pathway". Studies investigating the dysfunctional pathways that affect the progression of BC found "pyruvic metabolism" as one of the most closely involved. A series of differentially expressed genes such as ALDH2, ACACB and MDH1, contained in the "pyruvate metabolism" pathway, were down-regulated in BC samples 50,51 . Table 2. Association between metabolites concentration, lipid main fractions concentration and high-MBD BC cases, compared to low-MBD BC cases, in the 87 sets (EPIC Florence, low-vs high-MBD BC case-case study). a Odds ratios per standard deviation (SD) increase in metabolite concentration conditioned on age at cohort entry (± 5 years), type of mammographic examination (analogical/digital; negative/diagnostic) and year of mammographic examination (before/after 2000). Single metabolites and lipid main fractions separately added to the regression model. b Odds ratios per standard deviation (SD) increase in metabolite concentration conditioned on age at cohort entry (± 5 years), type of mammographic examination (analogical/digital; negative/diagnostic) and year of mammographic examination (before/after 2000) and adjusted for age at diagnosis, number of full-term pregnancies, breastfeeding (yes/no), menopausal status at baseline, ER status, body mass index at baseline. Single metabolites and lipid main fractions separately added to the regression model. c p values adjusted for false discovery rate (FDR) at α = 0,05 with Benjamini-Hochberg correction. www.nature.com/scientificreports/ Finally, the untargeted supervised analyses performed on the serum metabolomic fingerprint revealed a slight but significant separation between BC cases diagnosed in women with high vs low MBD as assessed in a period preceding the diagnosis.
The main limitation of the current study is represented by the relatively modest sample size. Moreover, the smaller proportion of low-MBD women that were pre-menopausal and with a low BMI precluded the possibility to add menopausal status and BMI as criteria in the selection of the matched sets. However, menopausal status and BMI were considered as confounding variables in the adjusted models together with other variables strongly related to mammographic breast density, such as age at diagnosis, number of full-term pregnancies, breastfeeding and ER status. These adjusting variables strongly impacted the significance of the associations and, for some of the VLDL subfractions, also the direction of the associations. This strong impact was predictable since these characteristics are strongly related to mammographic breast density but also to the subjects metabolic/lipid profile. On the other hand, our study has several strengths. First of all the duration of the pre-diagnostic period between the sample collection and the BC diagnosis was very similar between the two matched series and sufficient to preclude any severe effect of BC on the metabolic profile of each individual study subject. Blood samples were collected and aliquoted according to standard operating procedures and have been stored in a dedicated liquid nitrogen biobank thus ensuring a good preservation of the serum samples.
To our knowledge, this is the first study to examine the differences in the metabolomic profile of pre-diagnostic samples of BC cases diagnosed in women with high-and low-MBD, through a matched case-case design.
To date, biomarkers identified as differentially expressed in blood of patients of certain types of cancer have been mainly used before cancer diagnosis for risk assessment and screening, at diagnosis for classification and staging and after diagnosis in monitoring treatments or cancer recurrence 52,53 . Few of these biomarkers have been tested rigorously in pre-diagnostic serum collected from asymptomatic subjects. The utility of available biomarkers for diagnosis of early BC is currently unknown 54 . Few studies have been conducted on large patient cohorts using pre-diagnostic blood samples to investigate possible associations between metabolic biomarkers and breast cancer risk. A prospective nested case-control study was set up in the SU.VI.MAX cohort, including 206 breast cancer cases diagnosed during a 13-year follow-up and 396 matched controls. Untargeted NMR metabolomic profiles were established from baseline plasma samples. Women characterized by higher plasma levels of valine, lysine, arginine, glutamine, creatine, creatinine and glucose, and lower plasma levels of lipoproteins, lipids, glycoproteins, acetone, glycerol-derived compounds and unsaturated lipids had a higher risk of developing breast cancer 55 . On the other hand, many clinical studies, recently included in a systematic review 56 , investigated the metabolomic biomarkers and the pathways related to BC diagnosis. Among 22 studies performed Table 3. Association between lipoprotein subfractions concentration and high-MBD cases, compared to low-MBD BC cases, in the 87 sets (EPIC Florence, low-vs high-MBD BC case-case study). a Odds ratios per standard deviation (SD) increase in metabolite concentration conditioned on age at cohort entry (± 5 years), type of mammographic examination (analogical/digital; negative/diagnostic) and year of mammographic examination (before/after 2000). Single lipoprotein subfractions separately added to the regression model. b Odds ratios per standard deviation (SD) increase in metabolite concentration conditioned on age at cohort entry (± 5 years), type of mammographic examination (analogical/digital; negative/diagnostic) and year of mammographic examination (before/after 2000) and adjusted for age at diagnosis, number of full-term pregnancies, breastfeeding (yes/no), menopausal status, ER status, body mass index. Single lipoprotein subfractions separately added to the regression model. c p values adjusted for false discovery rate (FDR) at α = 0.05 with Benjamini-Hochberg correction. www.nature.com/scientificreports/ on plasma or serum samples, tyrosine was the most frequently mentioned metabolite related to BC diagnosis followed by other metabolites as alanine and glycine. Pathway analyses highlighted the role of alanine, aspartate and glutamate metabolism in BC development while pyruvate metabolism emerged among pathways with high, although not significant, impact 56 . However only few studies, mainly validation studies, used pre-diagnostic blood samples to investigate potential cancer biomarkers. Despite recent progress in the detection of low level biomarkers in pre-diagnostic BC samples, the small samples size of studies and the background technical/biological noise still represent a challenge 57 . This reinforces the need to conduct larger exploratory studies in pre-diagnostic samples.
To conclude, in this case-case study aimed to identify metabolites differentially present in pre-diagnostic serum samples from high-or low-MBD women developing BC, a possible role for pre-diagnostic level of tyrosine Total Cmpd is the total number of compounds in the pathway; Hits is the actually matched number of compounds from the user uploaded data; Raw p is the original p value calculated from the enrichment analysis; Holm is the p value adjusted by Holm-Bonferroni method; FDR is the p value adjusted using False Discovery Rate; Impact is the pathway impact value calculated from pathway topology analysis (EPIC Florence, low-vs high-MBD BC case-case study).