In this study, we test the ability of 8 Large Language Models to discriminate closely related action concepts, based on textual descriptions or on video representations. Our aim is to understand if these models can handle the fine-grained action understanding that humans perform with ease, particularly when there are cases of action-predicate mismatches, i.e., the same verb may describe different actions, or different verbs may refer to the same action. We experiment on the MACID dataset, a dataset of actions representing "pushing" events and manually annotated for action IDs taken from the IMAGACT ontology. We evaluate how prompt complexity and task formats influence models’ performance. Particularly, we test three different prompts with or without examples, two task formats (binary or multiple choice task), and two modalities (textual or visual). Results indicate that the binary task is not easier than the multiple-choice one, and that few-shot prompting generally improves models’ accuracy. Moreover, LLMs perform better when helped by lexical cues: accuracy increases when actions are expressed by different verbs, whereas it is lower when actions are expressed by the same verb.

Evaluating Models, Prompting Strategies, and Task Formats: a Case Study on the MACID Challenge / Matteo Rinaldi, R.V.. - ELETTRONICO. - (2025), pp. 955-964.

Evaluating Models, Prompting Strategies, and Task Formats: a Case Study on the MACID Challenge.

Rossella Varvara;Lorenzo Gregori;Andrea Amelio Ravelli
2025

Abstract

In this study, we test the ability of 8 Large Language Models to discriminate closely related action concepts, based on textual descriptions or on video representations. Our aim is to understand if these models can handle the fine-grained action understanding that humans perform with ease, particularly when there are cases of action-predicate mismatches, i.e., the same verb may describe different actions, or different verbs may refer to the same action. We experiment on the MACID dataset, a dataset of actions representing "pushing" events and manually annotated for action IDs taken from the IMAGACT ontology. We evaluate how prompt complexity and task formats influence models’ performance. Particularly, we test three different prompts with or without examples, two task formats (binary or multiple choice task), and two modalities (textual or visual). Results indicate that the binary task is not easier than the multiple-choice one, and that few-shot prompting generally improves models’ accuracy. Moreover, LLMs perform better when helped by lexical cues: accuracy increases when actions are expressed by different verbs, whereas it is lower when actions are expressed by the same verb.
2025
Proceedings of the Eleventh Italian Conference on Computational Linguistics (CLiC-it 2025)
955
964
Matteo Rinaldi, Rossella Varvara, Lorenzo Gregori, Andrea Amelio Ravelli
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in FLORE sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificatore per citare o creare un link a questa risorsa: https://hdl.handle.net/2158/1477534
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact