Object and action annotation in visual media beyond categories

Becattini, Federico

The constant growth of applications involving artificial intelligence and machine learning is an important cue for an imminent large scale diffusion of intelligent agents in our society, intended both as robotic and as software modules. In particular, the developments in computer vision are now more than ever of primary importance in order to provide a certain degree of awareness to these agents with respect to the environment they act in. In this thesis we reach out to this goal, tackling at first the need to detect objects in images at a fine-grained instance level. Moving then to the video domain we learn to discover unknown entities and to model the behavior of an important subset of those: humans. The first part of this thesis is dedicated to the image domain, for which we propose a taxonomy based technique to speed up an ensemble of instance based Exemplar-SVM classifiers. Exemplar-SVMs have been used in literature to tackle object detection tasks, while transferring at the same time semantic labels to the detections at a linear cost respect to the number of training samples. Our proposed method allows us to employ these classifiers achieving a sub-logarithmic dependence and resulting in speed gains up to 100x for large ensembles. We also demonstrate the application of similar techniques for image analysis in a real case scenario: the development of an Android App for the Museo Novecento in Florence, Italy, which is able to recognize paintings in the museum and transfer their artistic styles to personal photos. Transitioning to videos, we then propose an approach aimed at discovering objects in an unsupervised fashion by exploiting the temporal consistency of a frame-wise object proposal. Almost without relying on the visual content of the frames we are able to generate spatio-temporal tracks that contain generic objects and that can be used as a preliminary step to process a video sequence. Lastly, driven by the intuition that humans should be the focus of attention in video understanding, we introduce the problem of modeling the progress of human actions with a frame-level granularity. Besides knowing when someone is performing an action and where in every frame this person is, we believe that predicting how far an ongoing action has progressed will provide important benefits for an intelligent agent in order to interact with the surrounding environment and with the human performing the action. To this end we propose ProgressNet, a Recurrent Neural Network based model to jointly predict the spatio-temporal extent of an action and how far it has progressed during its execution. Experiments on the challenging UCF101 and J-HMDB datasets demonstrate the effectiveness of our method.

Object and action annotation in visual media beyond categories

Federico Becattini

2018

Abstract

Scheda breve

Scheda completa

Scheda completa (DC)

Citazioni

social impact

Object and action annotation in visual media beyond categories

Federico Becattini

2018

Abstract

Scheda breve Scheda completa Scheda completa (DC)

Informazioni

Citazioni

social impact

Conferma cancellazione

Scheda breve

Scheda completa

Scheda completa (DC)