Vision is not just the biological ability to detect light; it is an essential part of the capability of animals, humans, and future machines to interpret, understand and act in the environment. If a 2-year-old child encounters their very first tractor while hearing its name, the child from that point forward will recognize all variety of tractors, without confusing them with cars or trucks. To date, this surprising talent in visual learning, acquired with such a limited supervision from external agents, is something not easily reproducible in computer vision. Inspired by the quest of achieving similar learning schemes, in this work we study several aspects of computer vision, proposing innovative neural network training techniques. The first part of the thesis introduces the concept of input tuning for smooth learning paths, which involves dynamic transformations of inputs during training, inspired by the gradual visual skill acquisition observed in infants. We present a method that breaks down complex learning tasks into a series of incrementally challenging sub-tasks. This is achieved through input transformations that match the learner's skill level, enhancing model performance and deepening our understanding of the learning process. Then, we use the notion of input tuning in a different scenario, where a learner faces diverse tasks without a meaningful order, risking catastrophic forgetting. A novel training method keeps the learner's core static while using learnable transformations in the input space for environment adaptation, mitigating forgetting in realistic situations. The second part of the thesis shifts from the supervised learning focus of the first part, aiming to create autonomous visual agents that learn directly from their surroundings without human intervention. These agents forgo large labelled data collections, observing continuous video streams and learning online, electing motion as the primary source of information. As such, we start by investigating optical flow estimation in dynamic environments, using a purely online unsupervised approach. We then present two self-supervised learning techniques. The first employs an attention trajectory, simulating human visual attention and allowing agents to establish semantic connections among pixels. The second is motion-based, resulting from a layered autonomous development process. Results indicate significant progress in the quest for autonomous visual skill development, with intriguing open directions. Benefits obtained from controlling the learning pace through input tuning naturally open to future research directions, aimed at improving the robustness of visual agents that learn online without supervision.

Opening machine eyes over time: input tuning and motion-driven learning / Marullo Simone. - (2024).

Opening machine eyes over time: input tuning and motion-driven learning

Marullo Simone
2024

Abstract

Vision is not just the biological ability to detect light; it is an essential part of the capability of animals, humans, and future machines to interpret, understand and act in the environment. If a 2-year-old child encounters their very first tractor while hearing its name, the child from that point forward will recognize all variety of tractors, without confusing them with cars or trucks. To date, this surprising talent in visual learning, acquired with such a limited supervision from external agents, is something not easily reproducible in computer vision. Inspired by the quest of achieving similar learning schemes, in this work we study several aspects of computer vision, proposing innovative neural network training techniques. The first part of the thesis introduces the concept of input tuning for smooth learning paths, which involves dynamic transformations of inputs during training, inspired by the gradual visual skill acquisition observed in infants. We present a method that breaks down complex learning tasks into a series of incrementally challenging sub-tasks. This is achieved through input transformations that match the learner's skill level, enhancing model performance and deepening our understanding of the learning process. Then, we use the notion of input tuning in a different scenario, where a learner faces diverse tasks without a meaningful order, risking catastrophic forgetting. A novel training method keeps the learner's core static while using learnable transformations in the input space for environment adaptation, mitigating forgetting in realistic situations. The second part of the thesis shifts from the supervised learning focus of the first part, aiming to create autonomous visual agents that learn directly from their surroundings without human intervention. These agents forgo large labelled data collections, observing continuous video streams and learning online, electing motion as the primary source of information. As such, we start by investigating optical flow estimation in dynamic environments, using a purely online unsupervised approach. We then present two self-supervised learning techniques. The first employs an attention trajectory, simulating human visual attention and allowing agents to establish semantic connections among pixels. The second is motion-based, resulting from a layered autonomous development process. Results indicate significant progress in the quest for autonomous visual skill development, with intriguing open directions. Benefits obtained from controlling the learning pace through input tuning naturally open to future research directions, aimed at improving the robustness of visual agents that learn online without supervision.
2024
Marco Gori
ITALIA
Marullo Simone
File in questo prodotto:
File Dimensione Formato  
PhD_SimoneMarullo.pdf

accesso aperto

Tipologia: Pdf editoriale (Version of record)
Licenza: Creative commons
Dimensione 11.82 MB
Formato Adobe PDF
11.82 MB Adobe PDF

I documenti in FLORE sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificatore per citare o creare un link a questa risorsa: https://hdl.handle.net/2158/1355786
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact