This paper presents the Multimodal ACtion IDentification challenge (MACID), part of the first CALAMITA competition. The objective of this task is to evaluate the ability of Large Language Models (LLMs) to differentiate between closely related action concepts based on textual descriptions alone. The challenge is inspired by the "find the intruder" task, where models must identify an outlier among a set of 4 sentences that describe similar yet distinct actions. The dataset is composed of “pushing” events, and it highlights action-predicate mismatches, where the same verb may describe different actions or different verbs may refer to the same action. Although currently mono-modal (text-only), the task is designed for future multimodal integration, linking visual and textual representations to enhance action recognition. By probing a model’s capacity to resolve subtle linguistic ambiguities, the challenge underscores the need for deeper cognitive understanding in action-language alignment, ultimately testing the boundaries of LLMs’ ability to interpret action verbs and their associated concepts.
MACID - Multimodal ACtion IDentification: A CALAMITA Challenge / Andrea Amelio Ravelli, Rossella Varvara, Lorenzo Gregori. - ELETTRONICO. - (2024), pp. 0-0. ( CLiC-it 2024 - Italian Conference on Computational Linguistics Pisa, Italy December 4-6, 2024).
MACID - Multimodal ACtion IDentification: A CALAMITA Challenge
Andrea Amelio Ravelli
;Rossella Varvara
;Lorenzo Gregori
2024
Abstract
This paper presents the Multimodal ACtion IDentification challenge (MACID), part of the first CALAMITA competition. The objective of this task is to evaluate the ability of Large Language Models (LLMs) to differentiate between closely related action concepts based on textual descriptions alone. The challenge is inspired by the "find the intruder" task, where models must identify an outlier among a set of 4 sentences that describe similar yet distinct actions. The dataset is composed of “pushing” events, and it highlights action-predicate mismatches, where the same verb may describe different actions or different verbs may refer to the same action. Although currently mono-modal (text-only), the task is designed for future multimodal integration, linking visual and textual representations to enhance action recognition. By probing a model’s capacity to resolve subtle linguistic ambiguities, the challenge underscores the need for deeper cognitive understanding in action-language alignment, ultimately testing the boundaries of LLMs’ ability to interpret action verbs and their associated concepts.I documenti in FLORE sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.



