The constant growth of applications involving artificial intelligence and machine learning is an important cue for an imminent large scale diffusion of intelligent agents in our society, intended both as robotic and as software modules. In particular, the developments in computer vision are now more than ever of primary importance in order to provide a certain degree of awareness to these agents with respect to the environment they act in. In this thesis we reach out to this goal, tackling at first the need to detect objects in images at a fine-grained instance level. Moving then to the video domain we learn to discover unknown entities and to model the behavior of an important subset of those: humans. The first part of this thesis is dedicated to the image domain, for which we propose a taxonomy based technique to speed up an ensemble of instance based Exemplar-SVM classifiers. Exemplar-SVMs have been used in literature to tackle object detection tasks, while transferring at the same time semantic labels to the detections at a linear cost respect to the number of training samples. Our proposed method allows us to employ these classifiers achieving a sub-logarithmic dependence and resulting in speed gains up to 100x for large ensembles. We also demonstrate the application of similar techniques for image analysis in a real case scenario: the development of an Android App for the Museo Novecento in Florence, Italy, which is able to recognize paintings in the museum and transfer their artistic styles to personal photos. Transitioning to videos, we then propose an approach aimed at discovering objects in an unsupervised fashion by exploiting the temporal consistency of a frame-wise object proposal. Almost without relying on the visual content of the frames we are able to generate spatio-temporal tracks that contain generic objects and that can be used as a preliminary step to process a video sequence. Lastly, driven by the intuition that humans should be the focus of attention in video understanding, we introduce the problem of modeling the progress of human actions with a frame-level granularity. Besides knowing when someone is performing an action and where in every frame this person is, we believe that predicting how far an ongoing action has progressed will provide important benefits for an intelligent agent in order to interact with the surrounding environment and with the human performing the action. To this end we propose ProgressNet, a Recurrent Neural Network based model to jointly predict the spatio-temporal extent of an action and how far it has progressed during its execution. Experiments on the challenging UCF101 and J-HMDB datasets demonstrate the effectiveness of our method.

Object and action annotation in visual media beyond categories / Federico Becattini. - (2018).

Object and action annotation in visual media beyond categories

Federico Becattini
2018

Abstract

The constant growth of applications involving artificial intelligence and machine learning is an important cue for an imminent large scale diffusion of intelligent agents in our society, intended both as robotic and as software modules. In particular, the developments in computer vision are now more than ever of primary importance in order to provide a certain degree of awareness to these agents with respect to the environment they act in. In this thesis we reach out to this goal, tackling at first the need to detect objects in images at a fine-grained instance level. Moving then to the video domain we learn to discover unknown entities and to model the behavior of an important subset of those: humans. The first part of this thesis is dedicated to the image domain, for which we propose a taxonomy based technique to speed up an ensemble of instance based Exemplar-SVM classifiers. Exemplar-SVMs have been used in literature to tackle object detection tasks, while transferring at the same time semantic labels to the detections at a linear cost respect to the number of training samples. Our proposed method allows us to employ these classifiers achieving a sub-logarithmic dependence and resulting in speed gains up to 100x for large ensembles. We also demonstrate the application of similar techniques for image analysis in a real case scenario: the development of an Android App for the Museo Novecento in Florence, Italy, which is able to recognize paintings in the museum and transfer their artistic styles to personal photos. Transitioning to videos, we then propose an approach aimed at discovering objects in an unsupervised fashion by exploiting the temporal consistency of a frame-wise object proposal. Almost without relying on the visual content of the frames we are able to generate spatio-temporal tracks that contain generic objects and that can be used as a preliminary step to process a video sequence. Lastly, driven by the intuition that humans should be the focus of attention in video understanding, we introduce the problem of modeling the progress of human actions with a frame-level granularity. Besides knowing when someone is performing an action and where in every frame this person is, we believe that predicting how far an ongoing action has progressed will provide important benefits for an intelligent agent in order to interact with the surrounding environment and with the human performing the action. To this end we propose ProgressNet, a Recurrent Neural Network based model to jointly predict the spatio-temporal extent of an action and how far it has progressed during its execution. Experiments on the challenging UCF101 and J-HMDB datasets demonstrate the effectiveness of our method.
2018
Alberto Del Bimbo, Lorenzo Seidenari, Marco Bertini
ITALIA
Federico Becattini
File in questo prodotto:
File Dimensione Formato  
phd-thesis-Federico-Becattini_compressed.pdf

accesso aperto

Descrizione: Tesi dottorato
Tipologia: Tesi di dottorato
Licenza: Open Access
Dimensione 7.05 MB
Formato Adobe PDF
7.05 MB Adobe PDF

I documenti in FLORE sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificatore per citare o creare un link a questa risorsa: https://hdl.handle.net/2158/1121033
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact