This paper presents the Multimodal ACtion IDentification challenge (MACID), part of the first CALAMITA competition. The objective of this task is to evaluate the ability of Large Language Models (LLMs) to differentiate between closely related action concepts based on textual descriptions alone. The challenge is inspired by the "find the intruder" task, where models must identify an outlier among a set of 4 sentences that describe similar yet distinct actions. The dataset is composed of “pushing” events, and it highlights action-predicate mismatches, where the same verb may describe different actions or different verbs may refer to the same action. Although currently mono-modal (text-only), the task is designed for future multimodal integration, linking visual and textual representations to enhance action recognition. By probing a model’s capacity to resolve subtle linguistic ambiguities, the challenge underscores the need for deeper cognitive understanding in action-language alignment, ultimately testing the boundaries of LLMs’ ability to interpret action verbs and their associated concepts.

MACID - Multimodal ACtion IDentification: A CALAMITA Challenge / Andrea Amelio Ravelli, Rossella Varvara, Lorenzo Gregori. - ELETTRONICO. - (2024), pp. 0-0. ( CLiC-it 2024 - Italian Conference on Computational Linguistics Pisa, Italy December 4-6, 2024).

MACID - Multimodal ACtion IDentification: A CALAMITA Challenge

Andrea Amelio Ravelli
;
Rossella Varvara
;
Lorenzo Gregori
2024

Abstract

This paper presents the Multimodal ACtion IDentification challenge (MACID), part of the first CALAMITA competition. The objective of this task is to evaluate the ability of Large Language Models (LLMs) to differentiate between closely related action concepts based on textual descriptions alone. The challenge is inspired by the "find the intruder" task, where models must identify an outlier among a set of 4 sentences that describe similar yet distinct actions. The dataset is composed of “pushing” events, and it highlights action-predicate mismatches, where the same verb may describe different actions or different verbs may refer to the same action. Although currently mono-modal (text-only), the task is designed for future multimodal integration, linking visual and textual representations to enhance action recognition. By probing a model’s capacity to resolve subtle linguistic ambiguities, the challenge underscores the need for deeper cognitive understanding in action-language alignment, ultimately testing the boundaries of LLMs’ ability to interpret action verbs and their associated concepts.
2024
Proceedings of the 10th Italian Conference on Computational Linguistics (CLiC-it 2024)
CLiC-it 2024 - Italian Conference on Computational Linguistics
Pisa, Italy
December 4-6, 2024
Andrea Amelio Ravelli, Rossella Varvara, Lorenzo Gregori
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in FLORE sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificatore per citare o creare un link a questa risorsa: https://hdl.handle.net/2158/1426452
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact