In the paper are described some experiments related to a corpus derived from an authoritative historical Italian dictionary, namely the 'Grande dizionario della lingua italiana' (‘Great Dictionary of Italian Language’, in short GDLI). Thanks to the digitization and structuring of this dictionary, we have been able to set up the first nucleus of a diachronic annotated corpus that selects—according to specific criteria, and distinguishing between prose and poetry—some of the quotations that within the entries illustrate the different definitions and sub-definitions. In fact, the GDLI presents a huge collection of quotations covering the entire history of the Italian language and thus ranging from the Middle Ages to the present day. The corpus was enriched with linguistic annotation and used to train and evaluate NLP models for POS tagging and lemmatization, with promising results.

Towards the Creation of a Diachronic Corpus for Italian: a Case Study on the GDLI Quotations / Manuel Favaro, Elisa Guadagnini, Eva Sassolini, Marco Biffi, Simonetta Montemagni. - ELETTRONICO. - (2022), pp. 94-100. (Intervento presentato al convegno Language Resources and Evaluation Conference (LREC 2022). tenutosi a Marseille 25 June 2022 nel 25 June 2022).

Towards the Creation of a Diachronic Corpus for Italian: a Case Study on the GDLI Quotations

Marco Biffi
;
2022

Abstract

In the paper are described some experiments related to a corpus derived from an authoritative historical Italian dictionary, namely the 'Grande dizionario della lingua italiana' (‘Great Dictionary of Italian Language’, in short GDLI). Thanks to the digitization and structuring of this dictionary, we have been able to set up the first nucleus of a diachronic annotated corpus that selects—according to specific criteria, and distinguishing between prose and poetry—some of the quotations that within the entries illustrate the different definitions and sub-definitions. In fact, the GDLI presents a huge collection of quotations covering the entire history of the Italian language and thus ranging from the Middle Ages to the present day. The corpus was enriched with linguistic annotation and used to train and evaluate NLP models for POS tagging and lemmatization, with promising results.
2022
Proceedings of the Second Workshop on Language Technologies for Historical and Ancient Languages (LT4HALA 2022)
Language Resources and Evaluation Conference (LREC 2022).
Marseille 25 June 2022
25 June 2022
Manuel Favaro, Elisa Guadagnini, Eva Sassolini, Marco Biffi, Simonetta Montemagni
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in FLORE sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificatore per citare o creare un link a questa risorsa: https://hdl.handle.net/2158/1293168
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact