Word classes and corpus linguistics

Gregori, Lorenzo; Paci, Walter; Moneglia, Massimo

doi:10.1515/9783110746389-030

This chapter discusses the relevance of Part-of-speech (PoS) tagsets for disambiguating word occurrences and presents the main computational linguistics (CL) standards for annotating word classes in Romance corpora. We highlight the availability of corpora and CL tools for studying the quantitative distribution of PoS in language usage, demonstrating the feasibility of this perspective for Romance languages. It emerged, in both written and spoken variety, that open-class words are similarly distributed in Italian, Spanish, Portuguese, and French, while their quantitative variation is mainly dependent on the linguistic register. Moreover, quantitative trends in the relative frequency of PoS were found in the open class lexicon for both language varieties. Contextual needs are examined to see their influence on the relative frequency of word classes. Lastly, we discuss some linguistic phenomena observed in spoken corpora, showing there are still open challenges for PoS-tagging algorithms.

Word classes and corpus linguistics / Lorenzo Gregori, W.P.. - STAMPA. - (2024), pp. 769-796. [10.1515/9783110746389-030]