The increasing prevalence of deepfake audio technologies and their potential for malicious use in fields such as politics and media has raised significant concerns regarding the ability to distinguish fake from authentic audio recordings. This study proposes a robust technique for detecting synthetic audio by leveraging three spectral features: Linear Frequency Cepstral Coefficients (LFCC), Mel Frequency Cepstral Coefficients (MFCC), and Constant Q Cepstral Coefficients (CQCC). These features are processed using an enhanced ResNeXt architecture to improve classification accuracy between genuine and spoofed audio. Additionally, a Multi-Layer Perceptron (MLP)-based fusion technique is employed to further boost the model's performance. Extensive experiments were conducted using three datasets: the ASVspoof 2019 Logical Access (LA) dataset—featuring text-to-speech (TTS) and voice conversion attacks—the ASVspoof 2019 Physical Access (PA) dataset—including replay attacks—and the ASVspoof 2021 LA, PA and DF datasets. The proposed approach has demonstrated superior performance compared to state-of-the-art methods across all three datasets, particularly in detecting fake audio generated by text-to-speech (TTS) attacks. Its overall performance is summarized as follows: the system achieved an Equal Error Rate (EER) of 1.05% and a minimum tandem Detection Cost Function (min-tDCF) of 0.028 on the ASVspoof 2019 Logical Access (LA) dataset, and an EER of 1.14% and min-tDCF of 0.03 on the ASVspoof 2019 Physical Access(PA) dataset, demonstrating its robustness in detecting various types of audio spoofing attacks. Finally, on the ASVspoof 2021 LA dataset the method achieved an EER of 7.44% and min-tDCF of 0.35.

Deepfake audio detection with spectral features and ResNeXt-based architecture / Tahaoglu G.; Baracchi D.; Shullani D.; Iuliani M.; Piva A.. - In: KNOWLEDGE-BASED SYSTEMS. - ISSN 0950-7051. - ELETTRONICO. - 323:(2025), pp. 113726.0-113726.0. [10.1016/j.knosys.2025.113726]

Deepfake audio detection with spectral features and ResNeXt-based architecture

Baracchi D.;Shullani D.;Iuliani M.;Piva A.
2025

Abstract

The increasing prevalence of deepfake audio technologies and their potential for malicious use in fields such as politics and media has raised significant concerns regarding the ability to distinguish fake from authentic audio recordings. This study proposes a robust technique for detecting synthetic audio by leveraging three spectral features: Linear Frequency Cepstral Coefficients (LFCC), Mel Frequency Cepstral Coefficients (MFCC), and Constant Q Cepstral Coefficients (CQCC). These features are processed using an enhanced ResNeXt architecture to improve classification accuracy between genuine and spoofed audio. Additionally, a Multi-Layer Perceptron (MLP)-based fusion technique is employed to further boost the model's performance. Extensive experiments were conducted using three datasets: the ASVspoof 2019 Logical Access (LA) dataset—featuring text-to-speech (TTS) and voice conversion attacks—the ASVspoof 2019 Physical Access (PA) dataset—including replay attacks—and the ASVspoof 2021 LA, PA and DF datasets. The proposed approach has demonstrated superior performance compared to state-of-the-art methods across all three datasets, particularly in detecting fake audio generated by text-to-speech (TTS) attacks. Its overall performance is summarized as follows: the system achieved an Equal Error Rate (EER) of 1.05% and a minimum tandem Detection Cost Function (min-tDCF) of 0.028 on the ASVspoof 2019 Logical Access (LA) dataset, and an EER of 1.14% and min-tDCF of 0.03 on the ASVspoof 2019 Physical Access(PA) dataset, demonstrating its robustness in detecting various types of audio spoofing attacks. Finally, on the ASVspoof 2021 LA dataset the method achieved an EER of 7.44% and min-tDCF of 0.35.
2025
323
0
0
Tahaoglu G.; Baracchi D.; Shullani D.; Iuliani M.; Piva A.
File in questo prodotto:
File Dimensione Formato  
1-s2.0-S0950705125007725-main.pdf

accesso aperto

Tipologia: Pdf editoriale (Version of record)
Licenza: Creative commons
Dimensione 2.03 MB
Formato Adobe PDF
2.03 MB Adobe PDF

I documenti in FLORE sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificatore per citare o creare un link a questa risorsa: https://hdl.handle.net/2158/1432144
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 2
  • ???jsp.display-item.citation.isi??? 1
social impact