Deep scene features constitute an essential mean for various vision tasks, as they represent the perceptual content of the scene at multiple levels of details learned using deep learning. Such features can acquire information about the global context of the environment, which refers to the overall understanding of the scene, such as the main content depicted in an image, or the geometry, texture and categories that identify one entire scenario or a certain situation. Other features, instead, can capture local contexts that regard finer details, for instance, structure of the objects with their spatial positioning and how they relate to each other. Therefore, deep scene features grasp several characteristics of the environment at different granularities, which is crucial in vision tasks. For instance, scene understanding is a well-known problem in the literature, which aims at making semantic reasoning about the contents of a scene or an environment, by parsing low-level elements, e.g. textures, shapes and colors, up to higher meanings, e.g. objects, surfaces and relationships among parts of the scene, that can be represented through specialised deep features. Deep Convolutional Neural Networks (DCNNs) have achieved outstanding results addressing the scene understanding problem on the current image or video frame, in terms of object detection, depth estimation, motion estimation and so forth. However, this is not sufficient to correctly reason about future scenarios. Due to high uncertainty, the future estimation is an open problem and it is not solved entirely. Existing forecasting models are trained to make predictions at a certain unseen frame, while discarding potential intermediate future representations to carry out downstream tasks, such as driving motion planning. In general, deep scene features can be extracted to thoroughly analyse content and that plays an important role also in several forensic tasks. In the DeepFake Detection (DFD) task, specialised deep features need to be learned from the scene in order to reveal whether a content is real or tampered. Neural networks have proven the ability to detect fake images, by mainly exploiting the RGB space with extraordinary results. Nowadays, generative models and forgery techniques do not let humans to easily distinguish digital fake contents, and that is becoming very challenging even for neural networks. However, there are other aspects of the image one can look at, with the aim of learning synthetic patterns or inconsistencies in fake contents. This line of research is not fully explored in the literature, as more and more forgeries can be invented everyday to fool classifiers, especially in the RGB space. In this thesis, we first present a forecasting framework to address the future anticipation problem of high-level semantic contents in the scene, i.e. by predicting moving objects relying on optical flow as unique source of information. To do so, this unified architecture employs deep scene features extracted from the predicted optical flow and the present semantic context, in order to move current object masks onto future frames. In the second part, we devise a novel model to face the forecasting problem from another point of view, i.e. by predicting geometrical and spatio-temporal low-level details of the scene. To this end, we consider a multimodal approach also in a multitask learning fashion. This model addresses the existing limitations in forecasting tasks, by jointly predicting optical flow and depth for multiple future frames, including all the intermediate predictions. The autoregressive architecture demonstrates that sharing information to perform both tasks is beneficial and allow to mitigate the error propagation of single-timestep predictive models. In the last part, we focus on the Deepfake context. In particular we look into the face forgery detection problem as a classification task, with the objective to detect inconsistency in a novel feature space, in which surface aspects of the scene are taken into account. We address the problem from a different perspective, and thus we propose an innovative Deepfake detection method based on the alterations of geometrical aspects related to the data acquisition process. This approach points out the importance of considering the environment taken by the camera during the content acquisition, by extracting deep features describing subtle surface details, meaningful to detect fakes.

Leveraging deep scene features for vision tasks: from classification to forecasting / Andrea Ciamarra. - (2024).

Leveraging deep scene features for vision tasks: from classification to forecasting

Andrea Ciamarra
2024

Abstract

Deep scene features constitute an essential mean for various vision tasks, as they represent the perceptual content of the scene at multiple levels of details learned using deep learning. Such features can acquire information about the global context of the environment, which refers to the overall understanding of the scene, such as the main content depicted in an image, or the geometry, texture and categories that identify one entire scenario or a certain situation. Other features, instead, can capture local contexts that regard finer details, for instance, structure of the objects with their spatial positioning and how they relate to each other. Therefore, deep scene features grasp several characteristics of the environment at different granularities, which is crucial in vision tasks. For instance, scene understanding is a well-known problem in the literature, which aims at making semantic reasoning about the contents of a scene or an environment, by parsing low-level elements, e.g. textures, shapes and colors, up to higher meanings, e.g. objects, surfaces and relationships among parts of the scene, that can be represented through specialised deep features. Deep Convolutional Neural Networks (DCNNs) have achieved outstanding results addressing the scene understanding problem on the current image or video frame, in terms of object detection, depth estimation, motion estimation and so forth. However, this is not sufficient to correctly reason about future scenarios. Due to high uncertainty, the future estimation is an open problem and it is not solved entirely. Existing forecasting models are trained to make predictions at a certain unseen frame, while discarding potential intermediate future representations to carry out downstream tasks, such as driving motion planning. In general, deep scene features can be extracted to thoroughly analyse content and that plays an important role also in several forensic tasks. In the DeepFake Detection (DFD) task, specialised deep features need to be learned from the scene in order to reveal whether a content is real or tampered. Neural networks have proven the ability to detect fake images, by mainly exploiting the RGB space with extraordinary results. Nowadays, generative models and forgery techniques do not let humans to easily distinguish digital fake contents, and that is becoming very challenging even for neural networks. However, there are other aspects of the image one can look at, with the aim of learning synthetic patterns or inconsistencies in fake contents. This line of research is not fully explored in the literature, as more and more forgeries can be invented everyday to fool classifiers, especially in the RGB space. In this thesis, we first present a forecasting framework to address the future anticipation problem of high-level semantic contents in the scene, i.e. by predicting moving objects relying on optical flow as unique source of information. To do so, this unified architecture employs deep scene features extracted from the predicted optical flow and the present semantic context, in order to move current object masks onto future frames. In the second part, we devise a novel model to face the forecasting problem from another point of view, i.e. by predicting geometrical and spatio-temporal low-level details of the scene. To this end, we consider a multimodal approach also in a multitask learning fashion. This model addresses the existing limitations in forecasting tasks, by jointly predicting optical flow and depth for multiple future frames, including all the intermediate predictions. The autoregressive architecture demonstrates that sharing information to perform both tasks is beneficial and allow to mitigate the error propagation of single-timestep predictive models. In the last part, we focus on the Deepfake context. In particular we look into the face forgery detection problem as a classification task, with the objective to detect inconsistency in a novel feature space, in which surface aspects of the scene are taken into account. We address the problem from a different perspective, and thus we propose an innovative Deepfake detection method based on the alterations of geometrical aspects related to the data acquisition process. This approach points out the importance of considering the environment taken by the camera during the content acquisition, by extracting deep features describing subtle surface details, meaningful to detect fakes.
2024
Alberto Del Bimbo, Lorenzo Seidenari, Federico Becattini
ITALIA
Andrea Ciamarra
File in questo prodotto:
File Dimensione Formato  
Leveraging_Deep_Scene_Features_for_Vision_tasks__from_Classification_to_Forecasting.pdf

accesso aperto

Tipologia: Tesi di dottorato
Licenza: Open Access
Dimensione 19.58 MB
Formato Adobe PDF
19.58 MB Adobe PDF

I documenti in FLORE sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificatore per citare o creare un link a questa risorsa: https://hdl.handle.net/2158/1354311
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact