Vision is not just the biological ability to detect light; it is an essential part of the capability of animals, humans, and future machines to interpret, understand and act in the environment. If a 2-year-old child encounters their very first tractor while hearing its name, the child from that point forward will recognize all variety of tractors, without confusing them with cars or trucks. To date, this surprising talent in visual learning, acquired with such a limited supervision from external agents, is something not easily reproducible in computer vision. Inspired by the quest of achieving similar learning schemes, in this work we study several aspects of computer vision, proposing innovative neural network training techniques. The first part of the thesis introduces the concept of input tuning for smooth learning paths, which involves dynamic transformations of inputs during training, inspired by the gradual visual skill acquisition observed in infants. We present a method that breaks down complex learning tasks into a series of incrementally challenging sub-tasks. This is achieved through input transformations that match the learner's skill level, enhancing model performance and deepening our understanding of the learning process. Then, we use the notion of input tuning in a different scenario, where a learner faces diverse tasks without a meaningful order, risking catastrophic forgetting. A novel training method keeps the learner's core static while using learnable transformations in the input space for environment adaptation, mitigating forgetting in realistic situations. The second part of the thesis shifts from the supervised learning focus of the first part, aiming to create autonomous visual agents that learn directly from their surroundings without human intervention. These agents forgo large labelled data collections, observing continuous video streams and learning online, electing motion as the primary source of information. As such, we start by investigating optical flow estimation in dynamic environments, using a purely online unsupervised approach. We then present two self-supervised learning techniques. The first employs an attention trajectory, simulating human visual attention and allowing agents to establish semantic connections among pixels. The second is motion-based, resulting from a layered autonomous development process. Results indicate significant progress in the quest for autonomous visual skill development, with intriguing open directions. Benefits obtained from controlling the learning pace through input tuning naturally open to future research directions, aimed at improving the robustness of visual agents that learn online without supervision.
Opening machine eyes over time: input tuning and motion-driven learning / Marullo Simone. - (2024).
Opening machine eyes over time: input tuning and motion-driven learning
Marullo Simone
2024
Abstract
Vision is not just the biological ability to detect light; it is an essential part of the capability of animals, humans, and future machines to interpret, understand and act in the environment. If a 2-year-old child encounters their very first tractor while hearing its name, the child from that point forward will recognize all variety of tractors, without confusing them with cars or trucks. To date, this surprising talent in visual learning, acquired with such a limited supervision from external agents, is something not easily reproducible in computer vision. Inspired by the quest of achieving similar learning schemes, in this work we study several aspects of computer vision, proposing innovative neural network training techniques. The first part of the thesis introduces the concept of input tuning for smooth learning paths, which involves dynamic transformations of inputs during training, inspired by the gradual visual skill acquisition observed in infants. We present a method that breaks down complex learning tasks into a series of incrementally challenging sub-tasks. This is achieved through input transformations that match the learner's skill level, enhancing model performance and deepening our understanding of the learning process. Then, we use the notion of input tuning in a different scenario, where a learner faces diverse tasks without a meaningful order, risking catastrophic forgetting. A novel training method keeps the learner's core static while using learnable transformations in the input space for environment adaptation, mitigating forgetting in realistic situations. The second part of the thesis shifts from the supervised learning focus of the first part, aiming to create autonomous visual agents that learn directly from their surroundings without human intervention. These agents forgo large labelled data collections, observing continuous video streams and learning online, electing motion as the primary source of information. As such, we start by investigating optical flow estimation in dynamic environments, using a purely online unsupervised approach. We then present two self-supervised learning techniques. The first employs an attention trajectory, simulating human visual attention and allowing agents to establish semantic connections among pixels. The second is motion-based, resulting from a layered autonomous development process. Results indicate significant progress in the quest for autonomous visual skill development, with intriguing open directions. Benefits obtained from controlling the learning pace through input tuning naturally open to future research directions, aimed at improving the robustness of visual agents that learn online without supervision.File | Dimensione | Formato | |
---|---|---|---|
PhD_SimoneMarullo.pdf
accesso aperto
Tipologia:
Pdf editoriale (Version of record)
Licenza:
Creative commons
Dimensione
11.82 MB
Formato
Adobe PDF
|
11.82 MB | Adobe PDF |
I documenti in FLORE sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.