In the last eight years the computer vision field has experienced dramatic improvements thanks to the widespread availability of data and affordable parallel computing hardware like GPUs. These two factors have contributed to making possible the training of very deep neural network models in reasonable times using millions of labeled examples for supervision. Humans do not learn concepts in this way. We do not need a massive number of labeled examples to learn new concepts; instead we rely on a few (or even zero) examples, infer missing information, and generalize. Moreover, we retain previously learned concepts without the need to re-train. We can easily ride a bicycle after years of not doing so, or recognize an elephant even though we may not have seen one recently. These characteristics of human learning, in fact, stand in stark contrast to how deep models learn: they require massive amounts of labeled data for training due to overparameterization, they have limited generalization capabilities, and they easily forget previously learned tasks or concepts when trained on new ones. These characteristics limit the applicability of deep learning in some scenarios in which these problems are more evident. In this thesis we study some of these and propose strategies to overcome some of the negative aspect of deep neural network training. We still use the gradient-based learning paradigm, but we adapt it to address some of these differences between human learning and learning in deep networks. Our goal is to achieve better learning characteristics and improve performance in some specific applications. We first study the artwork instance recognition problem, for which it is very difficult to collect large collections of labeled images. Our proposed approach relies on web search engines to collect examples, which results in the two related problems of domain shift due to biases in search engines and noisy supervision. We propose several strategies to mitigate these problems. To better mimic the ability of humans to learn from compact semantic description of tasks, we then propose a zero-shot learning strategy to recognize never-seen artworks, instead relying solely on textual descriptions of the target artworks. Then we look at the problem of learning from scarce data for the no-reference image quality assessment (NR-IQA) problem. IQA is an application for which data is notoriously scarce due to the elevated cost for annotation. Humans have an innate ability to inductively generalize from a limited number of examples, and to better mimic this we propose a generative model able to generate controlled perturbations of the input image, with the goal of synthetically increase the number of training instances used to train the network to estimate input image quality. Finally, we focus on the problem of catastrophic forgetting in recurrent neural networks, using image captioning as problem domain. We propose two strategies for defining continual image captioning experimental protocols and develop a continual learning framework for image captioning models based on encoder-decoder architectures. A task is defined by a set of object categories that appears in the images that we want the model to be able to describe. We observe that catastrophic forgetting is even more pronounced in this setting and establish several baselines by adapting existing state-of-the-art techniques to our continual image captioning problem. Then, to mimic the human ability to retain and leverage past knowledge when acquiring new tasks, we propose to use a mask-based technique that allocates specific neurons to each task only during backpropagation. This way, novel tasks do not interfere with the previous ones and forgetting is avoided. At the same time, past knowledge is exploited thanks to the ability of the network to use neurons allocated to previous tasks during the forward pass, which in turn reduces the number of neurons needed to learn each new task.

Anthropomorphous Visual Recognition: Learning with Weak Supervision, with Scarce Data, and Incrementally over Transient Tasks / Del Chiaro, Riccardo. - (2021).

Anthropomorphous Visual Recognition: Learning with Weak Supervision, with Scarce Data, and Incrementally over Transient Tasks

Del Chiaro, Riccardo
2021

Abstract

In the last eight years the computer vision field has experienced dramatic improvements thanks to the widespread availability of data and affordable parallel computing hardware like GPUs. These two factors have contributed to making possible the training of very deep neural network models in reasonable times using millions of labeled examples for supervision. Humans do not learn concepts in this way. We do not need a massive number of labeled examples to learn new concepts; instead we rely on a few (or even zero) examples, infer missing information, and generalize. Moreover, we retain previously learned concepts without the need to re-train. We can easily ride a bicycle after years of not doing so, or recognize an elephant even though we may not have seen one recently. These characteristics of human learning, in fact, stand in stark contrast to how deep models learn: they require massive amounts of labeled data for training due to overparameterization, they have limited generalization capabilities, and they easily forget previously learned tasks or concepts when trained on new ones. These characteristics limit the applicability of deep learning in some scenarios in which these problems are more evident. In this thesis we study some of these and propose strategies to overcome some of the negative aspect of deep neural network training. We still use the gradient-based learning paradigm, but we adapt it to address some of these differences between human learning and learning in deep networks. Our goal is to achieve better learning characteristics and improve performance in some specific applications. We first study the artwork instance recognition problem, for which it is very difficult to collect large collections of labeled images. Our proposed approach relies on web search engines to collect examples, which results in the two related problems of domain shift due to biases in search engines and noisy supervision. We propose several strategies to mitigate these problems. To better mimic the ability of humans to learn from compact semantic description of tasks, we then propose a zero-shot learning strategy to recognize never-seen artworks, instead relying solely on textual descriptions of the target artworks. Then we look at the problem of learning from scarce data for the no-reference image quality assessment (NR-IQA) problem. IQA is an application for which data is notoriously scarce due to the elevated cost for annotation. Humans have an innate ability to inductively generalize from a limited number of examples, and to better mimic this we propose a generative model able to generate controlled perturbations of the input image, with the goal of synthetically increase the number of training instances used to train the network to estimate input image quality. Finally, we focus on the problem of catastrophic forgetting in recurrent neural networks, using image captioning as problem domain. We propose two strategies for defining continual image captioning experimental protocols and develop a continual learning framework for image captioning models based on encoder-decoder architectures. A task is defined by a set of object categories that appears in the images that we want the model to be able to describe. We observe that catastrophic forgetting is even more pronounced in this setting and establish several baselines by adapting existing state-of-the-art techniques to our continual image captioning problem. Then, to mimic the human ability to retain and leverage past knowledge when acquiring new tasks, we propose to use a mask-based technique that allocates specific neurons to each task only during backpropagation. This way, novel tasks do not interfere with the previous ones and forgetting is avoided. At the same time, past knowledge is exploited thanks to the ability of the network to use neurons allocated to previous tasks during the forward pass, which in turn reduces the number of neurons needed to learn each new task.
2021
Andrew David Bagdanov, Lorenzo Seidenari
ITALIA
Del Chiaro, Riccardo
File in questo prodotto:
File Dimensione Formato  
thesis Del Chiaro 07-05-21.pdf

accesso aperto

Descrizione: Thesis
Tipologia: Tesi di dottorato
Licenza: Open Access
Dimensione 28.34 MB
Formato Adobe PDF
28.34 MB Adobe PDF

I documenti in FLORE sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificatore per citare o creare un link a questa risorsa: https://hdl.handle.net/2158/1238101
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact