The ability to understand and generate language is one of the most fascinating and peculiar aspects of humankind. We can discuss with other individuals about facts, events, stories or the most abstract aspects of our existences, only because of the power and the expressiveness of language. Natural Language Processing (NLP) studies the intriguing properties of languages, the rules, their evolution, its connections with semantics, knowledge and generation, and tries to harness its features into automatic processes. Language is built upon a collection of symbols, and meaning comes from their composition. Such symbolic nature limited for many years the development of Machine Learning solutions for NLP. Many models relied on rule-based methods, and Machine Learning models where based on handcrafted task-specific features. In the last decade there have been incredible advances in the study of language, thanks to the combination of Deep Learning models with Transfer Learning techniques. Deep models can learn from huge amounts of data, and they are proven to be particularly effective in learning feature representations automatically from high-dimensional input spaces. Transfer Learning techniques aim at reducing the need for data by reusing representations learned from related tasks. In the scope of NLP, it is possible thanks to Language Modeling. Language Modeling related tasks are essential in the construction of general purpose representations from unlabelled large textual corpora, allowing the shift from symbolic to sub-symbolic representations of language. In this thesis, motivated by the need of moving steps toward unified NLP agents capable of understanding and generating language in a human-like fashion, we face different NLP challenges, proposing solutions based on Language-Modeling. In summary, we develop a character-aware neural language model to learn general purpose word and context representations, and use the encoder to face several language understanding problems, included an agent for the extraction of entities and relations from an online stream of text. We then focus on Language Generation, addressing two different problems: Paraphrasing and Poem Generation, where in one the generation is tied with information in input, whereas in the other the production of text requires creativity. In addition, we also present how language models can offer aid in the analysis of language varieties. We propose a new perplexity-based indicator to measure distances between different diachronic or dialectical languages.

Language Models for Text Understanding and Generation / Andrea Zugarini. - (2021).

Language Models for Text Understanding and Generation

Andrea Zugarini
2021

Abstract

The ability to understand and generate language is one of the most fascinating and peculiar aspects of humankind. We can discuss with other individuals about facts, events, stories or the most abstract aspects of our existences, only because of the power and the expressiveness of language. Natural Language Processing (NLP) studies the intriguing properties of languages, the rules, their evolution, its connections with semantics, knowledge and generation, and tries to harness its features into automatic processes. Language is built upon a collection of symbols, and meaning comes from their composition. Such symbolic nature limited for many years the development of Machine Learning solutions for NLP. Many models relied on rule-based methods, and Machine Learning models where based on handcrafted task-specific features. In the last decade there have been incredible advances in the study of language, thanks to the combination of Deep Learning models with Transfer Learning techniques. Deep models can learn from huge amounts of data, and they are proven to be particularly effective in learning feature representations automatically from high-dimensional input spaces. Transfer Learning techniques aim at reducing the need for data by reusing representations learned from related tasks. In the scope of NLP, it is possible thanks to Language Modeling. Language Modeling related tasks are essential in the construction of general purpose representations from unlabelled large textual corpora, allowing the shift from symbolic to sub-symbolic representations of language. In this thesis, motivated by the need of moving steps toward unified NLP agents capable of understanding and generating language in a human-like fashion, we face different NLP challenges, proposing solutions based on Language-Modeling. In summary, we develop a character-aware neural language model to learn general purpose word and context representations, and use the encoder to face several language understanding problems, included an agent for the extraction of entities and relations from an online stream of text. We then focus on Language Generation, addressing two different problems: Paraphrasing and Poem Generation, where in one the generation is tied with information in input, whereas in the other the production of text requires creativity. In addition, we also present how language models can offer aid in the analysis of language varieties. We propose a new perplexity-based indicator to measure distances between different diachronic or dialectical languages.
2021
Marco Maggini
ITALIA
Andrea Zugarini
File in questo prodotto:
File Dimensione Formato  
PhD_Thesis_AZ.pdf

accesso aperto

Descrizione: Tesi di Dottorato
Tipologia: Tesi di dottorato
Licenza: Open Access
Dimensione 2.33 MB
Formato Adobe PDF
2.33 MB Adobe PDF

I documenti in FLORE sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificatore per citare o creare un link a questa risorsa: https://hdl.handle.net/2158/1238004
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact