The human brain has always been one of the most fascinating fields of study. The first theories and research results about machine learning date back to around fifty years ago, but only in the last few years - thanks to increasing computational power - these theories have been put into practice with applications in different fields such as autonomous driving, human-computer interaction, medical imaging, and many others. Perception is perhaps the most important way humans understand the physical world, and language is how humans communicate their experiences. For this reason, the integration of vision and language has been gaining attention and language-aligned visual features have been shown effective for vision-language tasks. Recently these tasks have received significant attention from the Artificial Intelligence community, however many tasks in this field are far from solved and re- quire further research. In this dissertation, we focus on three vision and language tasks: Visual Question Answering (VQA), Image Captioning (IC), and Cross-Modal Retrieval (CMR). Visual Question Answering systems are capable of answering visual questions (that is, questions referring to the semantic content of images), but a significant limitation is the inability to answer contextual questions (that is, those referring to image content but that require external information to be answered). For this reason, we investigate the use of external knowledge in support of answer generation. In the first part of this thesis, we propose two approaches to handle and extract external textual information and improve VQA in the Cultural Heritage domain - a domain where external information is crucial. Moreover, we propose a data collection and annotation technique, as well as a large dataset for VQA in the Cultural Heritage domain. In the second part of this thesis, we investigate the application of Image Captioning to Image Quality Assessment (IQA). IQA is the task of evaluating the perceptual quality of images. IQA approaches are severely limited by the lack of data for training. After preliminary work on generative data augmentation, we propose a completely novel approach to exploiting visual captioning in order to infer quality scores in both No-Reference and Full- Reference scenarios. Finally, Cross-Modal Retrieval approaches perform ranking of images based on text (and vice versa) at a merely descriptive level (focusing on what objects are in the image and their number). To address this problem, in the last part of this thesis, we propose a new architecture that exploits scene text to improve the performance of cross-modal retrieval tasks on multiple datasets that vary in the percentage of scene-text images and the type of caption (contextual, visual).
Vision and language tasks: applications to real scenarios and image quality assessment / Pietro Bongini. - (2023).
Vision and language tasks: applications to real scenarios and image quality assessment
Pietro Bongini
2023
Abstract
The human brain has always been one of the most fascinating fields of study. The first theories and research results about machine learning date back to around fifty years ago, but only in the last few years - thanks to increasing computational power - these theories have been put into practice with applications in different fields such as autonomous driving, human-computer interaction, medical imaging, and many others. Perception is perhaps the most important way humans understand the physical world, and language is how humans communicate their experiences. For this reason, the integration of vision and language has been gaining attention and language-aligned visual features have been shown effective for vision-language tasks. Recently these tasks have received significant attention from the Artificial Intelligence community, however many tasks in this field are far from solved and re- quire further research. In this dissertation, we focus on three vision and language tasks: Visual Question Answering (VQA), Image Captioning (IC), and Cross-Modal Retrieval (CMR). Visual Question Answering systems are capable of answering visual questions (that is, questions referring to the semantic content of images), but a significant limitation is the inability to answer contextual questions (that is, those referring to image content but that require external information to be answered). For this reason, we investigate the use of external knowledge in support of answer generation. In the first part of this thesis, we propose two approaches to handle and extract external textual information and improve VQA in the Cultural Heritage domain - a domain where external information is crucial. Moreover, we propose a data collection and annotation technique, as well as a large dataset for VQA in the Cultural Heritage domain. In the second part of this thesis, we investigate the application of Image Captioning to Image Quality Assessment (IQA). IQA is the task of evaluating the perceptual quality of images. IQA approaches are severely limited by the lack of data for training. After preliminary work on generative data augmentation, we propose a completely novel approach to exploiting visual captioning in order to infer quality scores in both No-Reference and Full- Reference scenarios. Finally, Cross-Modal Retrieval approaches perform ranking of images based on text (and vice versa) at a merely descriptive level (focusing on what objects are in the image and their number). To address this problem, in the last part of this thesis, we propose a new architecture that exploits scene text to improve the performance of cross-modal retrieval tasks on multiple datasets that vary in the percentage of scene-text images and the type of caption (contextual, visual).File | Dimensione | Formato | |
---|---|---|---|
tesi_dottorato.pdf
accesso aperto
Tipologia:
Pdf editoriale (Version of record)
Licenza:
Open Access
Dimensione
14.82 MB
Formato
Adobe PDF
|
14.82 MB | Adobe PDF |
I documenti in FLORE sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.