In this study, we test the ability of 8 Large Language Models to discriminate closely related action concepts, based on textual descriptions or on video representations. Our aim is to understand if these models can handle the fine-grained action understanding that humans perform with ease, particularly when there are cases of action-predicate mismatches, i.e., the same verb may describe different actions, or different verbs may refer to the same action. We experiment on the MACID dataset, a dataset of actions representing "pushing" events and manually annotated for action IDs taken from the IMAGACT ontology. We evaluate how prompt complexity and task formats influence models’ performance. Particularly, we test three different prompts with or without examples, two task formats (binary or multiple choice task), and two modalities (textual or visual). Results indicate that the binary task is not easier than the multiple-choice one, and that few-shot prompting generally improves models’ accuracy. Moreover, LLMs perform better when helped by lexical cues: accuracy increases when actions are expressed by different verbs, whereas it is lower when actions are expressed by the same verb.
Evaluating Models, Prompting Strategies, and Task Formats: a Case Study on the MACID Challenge / Matteo Rinaldi, R.V.. - ELETTRONICO. - (2025), pp. 955-964.
Evaluating Models, Prompting Strategies, and Task Formats: a Case Study on the MACID Challenge.
Rossella Varvara;Lorenzo Gregori;Andrea Amelio Ravelli
2025
Abstract
In this study, we test the ability of 8 Large Language Models to discriminate closely related action concepts, based on textual descriptions or on video representations. Our aim is to understand if these models can handle the fine-grained action understanding that humans perform with ease, particularly when there are cases of action-predicate mismatches, i.e., the same verb may describe different actions, or different verbs may refer to the same action. We experiment on the MACID dataset, a dataset of actions representing "pushing" events and manually annotated for action IDs taken from the IMAGACT ontology. We evaluate how prompt complexity and task formats influence models’ performance. Particularly, we test three different prompts with or without examples, two task formats (binary or multiple choice task), and two modalities (textual or visual). Results indicate that the binary task is not easier than the multiple-choice one, and that few-shot prompting generally improves models’ accuracy. Moreover, LLMs perform better when helped by lexical cues: accuracy increases when actions are expressed by different verbs, whereas it is lower when actions are expressed by the same verb.I documenti in FLORE sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.



