Whenever we encounter a novel visual scene, we are able to quickly form an impression of its content, recognizing objects as coherent wholes despite the fragmentary nature of the raw sensory information we perceive, made up of pieces of different shapes, textures and colours. This fundamental cognitive capacity is developed in the earliest stages of human and animal life, with little or no supervision, through continuous visual observation and interactions with the environment. However, prevailing learning paradigms used to solve computer vision tasks often overlook the time dimension and rely on vast collections of labelled, independently and identically distributed static images that are processed offline through the use of computational resources that are often available to a few. To date, there is a lack of models that are able to continuously learn from a single, possibly never-ending, stream of visual information, capitalizing on the inherent temporal coherence of the problem and retaining important information without forgetting past experiences. In this dissertation, we will thoroughly investigate some of the issues that arise when considering such a learning scenario, in which the time dimension is directly involved. To filter out irrelevant information and foster the development of object-centered representations, we will consider the role of visual spatial attention. In particular, we propose a novel spatially local and temporal coherent model of the focus of attention that is shown to be extremely effective in simulating gaze patterns in the case of videos. We will show how to integrate a given attention model into classical convolutional layers, offering an effective foveated computational mechanism in terms of both representational issues and computational resources. Temporal coherence, the distinctive property of video streams, is also proposed as a fundamental aspect that should be reflected in every meaningful visual representation. This property is fostered through an unsupervised motion-invariance criterion of visual features with respect to their corresponding velocity fields. In a different declination, temporal coherence can also be realized along a given focus of attention trajectory and, together with the promoted spatial uniformity within the considered moving object, confirms the role of these kinds of consistency principles in tackling the large sample complexity usually required in the training of modern computer vision architectures. Finally, we will theoretically ground the notion of learning over time, when considering models that continuously evolve and adapt in accordance with the temporal dynamics dictated by the environment. In this context, considering recurrent-like models, we will offer a novel interpretation of the learning process that will be defined, in a principled way, in terms of optical control techniques, where the parameters of a model act as control variables that steer its dynamic in accordance with a predefined optimality criterion.

Learning over time from a continuous video stream / Lapo Faggi. - (2023).

Learning over time from a continuous video stream

Lapo Faggi
2023

Abstract

Whenever we encounter a novel visual scene, we are able to quickly form an impression of its content, recognizing objects as coherent wholes despite the fragmentary nature of the raw sensory information we perceive, made up of pieces of different shapes, textures and colours. This fundamental cognitive capacity is developed in the earliest stages of human and animal life, with little or no supervision, through continuous visual observation and interactions with the environment. However, prevailing learning paradigms used to solve computer vision tasks often overlook the time dimension and rely on vast collections of labelled, independently and identically distributed static images that are processed offline through the use of computational resources that are often available to a few. To date, there is a lack of models that are able to continuously learn from a single, possibly never-ending, stream of visual information, capitalizing on the inherent temporal coherence of the problem and retaining important information without forgetting past experiences. In this dissertation, we will thoroughly investigate some of the issues that arise when considering such a learning scenario, in which the time dimension is directly involved. To filter out irrelevant information and foster the development of object-centered representations, we will consider the role of visual spatial attention. In particular, we propose a novel spatially local and temporal coherent model of the focus of attention that is shown to be extremely effective in simulating gaze patterns in the case of videos. We will show how to integrate a given attention model into classical convolutional layers, offering an effective foveated computational mechanism in terms of both representational issues and computational resources. Temporal coherence, the distinctive property of video streams, is also proposed as a fundamental aspect that should be reflected in every meaningful visual representation. This property is fostered through an unsupervised motion-invariance criterion of visual features with respect to their corresponding velocity fields. In a different declination, temporal coherence can also be realized along a given focus of attention trajectory and, together with the promoted spatial uniformity within the considered moving object, confirms the role of these kinds of consistency principles in tackling the large sample complexity usually required in the training of modern computer vision architectures. Finally, we will theoretically ground the notion of learning over time, when considering models that continuously evolve and adapt in accordance with the temporal dynamics dictated by the environment. In this context, considering recurrent-like models, we will offer a novel interpretation of the learning process that will be defined, in a principled way, in terms of optical control techniques, where the parameters of a model act as control variables that steer its dynamic in accordance with a predefined optimality criterion.
2023
Marco Gori
ITALIA
Lapo Faggi
File in questo prodotto:
File Dimensione Formato  
PhD_Thesis_Lapo_Faggi.pdf

accesso aperto

Descrizione: Learning Over Time from a Continuous Video Stream
Tipologia: Pdf editoriale (Version of record)
Licenza: Creative commons
Dimensione 47.23 MB
Formato Adobe PDF
47.23 MB Adobe PDF

I documenti in FLORE sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificatore per citare o creare un link a questa risorsa: https://hdl.handle.net/2158/1322253
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact