This research investigates the pragmatic competence of Large Language Models (LLMs) in interpreting implicit meanings within Italian political discourse. Using the IMPAQTS-PIDMM dataset, which is a multimodal benchmark derived from the 2.5-million-token IMPAQTS corpus, the experiment evaluates how e!ectively models identify tendentious content such as presuppositions and implicatures. The study compares the performance of text-only LLMs against speech-based models (SpeechLMs) that process both audio and transcriptions to determine if acoustic cues enhance understanding. The results reveal that text-only models significantly outperform multimodal variants, with Qwen2.5-72B achieving the highest global accuracy of 0.863. Surprisingly, the inclusion of audio did not improve performance, as SpeechLMs like GPT-4o-mini-audio-preview and Qwen2-Audio-7B-Instruct obtained lower accuracy scores and a higher frequency of missed answers compared to their text-only equivalents. Across all tested architectures, models generally demonstrated a superior ability to process presuppositions over implicatures.
Evaluating the abilities of LLMs and SpeechLMs in discovering implicit contents of Italian political speeches / Lorenzo Gregori; Walter Paci; Alessandro Panunzi. - ELETTRONICO. - (2026), pp. 165-170.
Evaluating the abilities of LLMs and SpeechLMs in discovering implicit contents of Italian political speeches
Lorenzo Gregori
;Walter Paci
;Alessandro Panunzi
2026
Abstract
This research investigates the pragmatic competence of Large Language Models (LLMs) in interpreting implicit meanings within Italian political discourse. Using the IMPAQTS-PIDMM dataset, which is a multimodal benchmark derived from the 2.5-million-token IMPAQTS corpus, the experiment evaluates how e!ectively models identify tendentious content such as presuppositions and implicatures. The study compares the performance of text-only LLMs against speech-based models (SpeechLMs) that process both audio and transcriptions to determine if acoustic cues enhance understanding. The results reveal that text-only models significantly outperform multimodal variants, with Qwen2.5-72B achieving the highest global accuracy of 0.863. Surprisingly, the inclusion of audio did not improve performance, as SpeechLMs like GPT-4o-mini-audio-preview and Qwen2-Audio-7B-Instruct obtained lower accuracy scores and a higher frequency of missed answers compared to their text-only equivalents. Across all tested architectures, models generally demonstrated a superior ability to process presuppositions over implicatures.I documenti in FLORE sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.



