In recent years, Natural Language Processing (NLP) and Text Mining have become an ever-increasing field of research, also due to the advancements of Deep Learning and Language Models that allow tackling several interesting and novel problems in different application domains. Traditional techniques of text mining mostly relied on structured data to design machine learning algorithms. Nonetheless, a growing number of online platforms contain a lot of unstructured information that represent a great value for both Industry, especially in the context of Industry 4.0, and Public Administration services, e.g. for smart cities. This holds especially true in the context of social media, where the production of user-generated data is rapidly growing. Such data can be exploited with great benefits for several purposes, including profiling, information extraction, and classification. User-generated texts can in fact provide crucial insight into their interests and their skills and mindset, and can enable the comprehension of wider phenomena such as how information is spread through the internet. The goal of the present work is twofold. Firstly, several case studies are provided to demonstrate how a mixture of NLP and Text Mining approaches, and in particular the notion of distributional semantics, can be successfully exploited to model different kinds of profiles that are purely based on the provided unstructured textual information. First, city areas are profiled exploiting newspaper articles by means of word embeddings and clustering to categorize them based on their tags. Second, experiments are performed using distributional representations (aka embeddings) of entire sequences of texts. Several techniques, including traditional methods and Language Models, aimed at profiling professional figures based on their résumés are proposed and evaluated. Secondly, such key concepts and insights are applied to the challenging and open task of fake news detection and fact-checking, in order to build models capable of distinguishing between trustworthy and not trustworthy information. The proposed method exploits the semantic similarity of texts. An architecture exploiting state-of-the-art language models for semantic textual similarity and classification is proposed to perform fact-checking. The approach is evaluated against real world data containing fake news. To collect and label the data, a methodology is proposed that is able to include both real/fake news and a ground truth. The framework has been exploited to face the problems of data collection and annotation of fake news, also by exploiting fact-checking techniques. In light of the obtained results, advantages and shortcomings of approaches based on distributional text embeddings are discussed, as is the effectiveness of the proposed system for detecting fake news exploiting factually correct information. The proposed method is shown to be a viable alternative to perform fake news detection with respect to a traditional classification-based approach.

Combining natural language processing and machine learning for profiling and fake news detection / Alessandro Bondielli. - (2021).

Combining natural language processing and machine learning for profiling and fake news detection

Alessandro Bondielli
2021

Abstract

In recent years, Natural Language Processing (NLP) and Text Mining have become an ever-increasing field of research, also due to the advancements of Deep Learning and Language Models that allow tackling several interesting and novel problems in different application domains. Traditional techniques of text mining mostly relied on structured data to design machine learning algorithms. Nonetheless, a growing number of online platforms contain a lot of unstructured information that represent a great value for both Industry, especially in the context of Industry 4.0, and Public Administration services, e.g. for smart cities. This holds especially true in the context of social media, where the production of user-generated data is rapidly growing. Such data can be exploited with great benefits for several purposes, including profiling, information extraction, and classification. User-generated texts can in fact provide crucial insight into their interests and their skills and mindset, and can enable the comprehension of wider phenomena such as how information is spread through the internet. The goal of the present work is twofold. Firstly, several case studies are provided to demonstrate how a mixture of NLP and Text Mining approaches, and in particular the notion of distributional semantics, can be successfully exploited to model different kinds of profiles that are purely based on the provided unstructured textual information. First, city areas are profiled exploiting newspaper articles by means of word embeddings and clustering to categorize them based on their tags. Second, experiments are performed using distributional representations (aka embeddings) of entire sequences of texts. Several techniques, including traditional methods and Language Models, aimed at profiling professional figures based on their résumés are proposed and evaluated. Secondly, such key concepts and insights are applied to the challenging and open task of fake news detection and fact-checking, in order to build models capable of distinguishing between trustworthy and not trustworthy information. The proposed method exploits the semantic similarity of texts. An architecture exploiting state-of-the-art language models for semantic textual similarity and classification is proposed to perform fact-checking. The approach is evaluated against real world data containing fake news. To collect and label the data, a methodology is proposed that is able to include both real/fake news and a ground truth. The framework has been exploited to face the problems of data collection and annotation of fake news, also by exploiting fact-checking techniques. In light of the obtained results, advantages and shortcomings of approaches based on distributional text embeddings are discussed, as is the effectiveness of the proposed system for detecting fake news exploiting factually correct information. The proposed method is shown to be a viable alternative to perform fake news detection with respect to a traditional classification-based approach.
2021
Francesco Marcelloni
ITALIA
Alessandro Bondielli
File in questo prodotto:
File Dimensione Formato  
PhDThesis_AlessandroBondielli.pdf

accesso aperto

Descrizione: Tesi di dottorato Alessandro Bondielli
Tipologia: Tesi di dottorato
Licenza: Tutti i diritti riservati
Dimensione 6.19 MB
Formato Adobe PDF
6.19 MB Adobe PDF

I documenti in FLORE sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificatore per citare o creare un link a questa risorsa: https://hdl.handle.net/2158/1244287
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact