In data mining, an episode is defined as a sub-sequence of events or symbols extracted from a single sequence of events or symbols. Frequent Episode Mining is the field of data mining that searches for episodes that occur frequently in a sequence over time, following some specific temporal order. In this thesis, we also consider episodes referring to the domains of Natural Language Processing (NLP) and Text Mining. In particular, we deal with text episodes, defined as text sub-sequences in an extended text sequence, and interesting text episodes, which are text episodes meeting particular conditions relevant to the application domain. We propose different approaches for mining interesting episodes in three distinct case studies. For all the studies, we first present the solutions and then discuss the obtained results. The first case study is the News Collector, a system developed to assist readers in automatically gathering comprehensive information about a news event from various sources. We have proposed a methodology based on transformer neural models, text summarization models, and decision rules to highlight information divergence between texts. In this context, the “text sequence” is represented by several articles on the same topic of a given reference article already read by the user. The “interesting text episode” to mine consists of a piece of information represented by a set of sentences that form a sub-sequence of non-redundant information within the larger sequence of articles. This subsequence is summarized and returned to the user. The second case study consists of a framework for mining frequent subsequences of actions from logs that report the interaction of operators with specific applications. The system aims to automatically spot repetitive subsequences of actions that are suitable for possible automation. This framework has been developed within the AUTOMIA project and it has also been used to investigate how techniques of Episode Mining can be employed in conjunction with a specific similarity metric. In the third case study, we investigate the issue of unverified information and its dissemination on the web. Here, an “interesting text episode” is defined as an article or a social media post classified as fake within a sequence of articles or posts. We introduce a methodology for collecting and labelling both authentic and fake news, while also establishing ground truth for real-world events. This methodology is then applied to two specific events: the 2019 Notre Dame fire and the more recent Ukraine-Russian war. We employ the concept of information divergence to detect fake news: whenever the information begins to diverge from the one obtained by trustable sources, the information is likely fake.
Episode detection and mining in real world applications / Pietro Dell'Oglio. - (2024).
Episode detection and mining in real world applications
Pietro Dell'Oglio
2024
Abstract
In data mining, an episode is defined as a sub-sequence of events or symbols extracted from a single sequence of events or symbols. Frequent Episode Mining is the field of data mining that searches for episodes that occur frequently in a sequence over time, following some specific temporal order. In this thesis, we also consider episodes referring to the domains of Natural Language Processing (NLP) and Text Mining. In particular, we deal with text episodes, defined as text sub-sequences in an extended text sequence, and interesting text episodes, which are text episodes meeting particular conditions relevant to the application domain. We propose different approaches for mining interesting episodes in three distinct case studies. For all the studies, we first present the solutions and then discuss the obtained results. The first case study is the News Collector, a system developed to assist readers in automatically gathering comprehensive information about a news event from various sources. We have proposed a methodology based on transformer neural models, text summarization models, and decision rules to highlight information divergence between texts. In this context, the “text sequence” is represented by several articles on the same topic of a given reference article already read by the user. The “interesting text episode” to mine consists of a piece of information represented by a set of sentences that form a sub-sequence of non-redundant information within the larger sequence of articles. This subsequence is summarized and returned to the user. The second case study consists of a framework for mining frequent subsequences of actions from logs that report the interaction of operators with specific applications. The system aims to automatically spot repetitive subsequences of actions that are suitable for possible automation. This framework has been developed within the AUTOMIA project and it has also been used to investigate how techniques of Episode Mining can be employed in conjunction with a specific similarity metric. In the third case study, we investigate the issue of unverified information and its dissemination on the web. Here, an “interesting text episode” is defined as an article or a social media post classified as fake within a sequence of articles or posts. We introduce a methodology for collecting and labelling both authentic and fake news, while also establishing ground truth for real-world events. This methodology is then applied to two specific events: the 2019 Notre Dame fire and the more recent Ukraine-Russian war. We employ the concept of information divergence to detect fake news: whenever the information begins to diverge from the one obtained by trustable sources, the information is likely fake.File | Dimensione | Formato | |
---|---|---|---|
thesis_phd__smart_computing_delloglio.pdf
accesso aperto
Tipologia:
Pdf editoriale (Version of record)
Licenza:
Open Access
Dimensione
5.68 MB
Formato
Adobe PDF
|
5.68 MB | Adobe PDF |
I documenti in FLORE sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.