The recent growth of the World Wide Web at increasing rate and speed and the number of online available resources populating Internet represent a massive source of knowledge for various research and business interests. Such knowledge is, for the most part, embedded in the textual content of web pages and documents, which is largely represented as unstructured natural language formats. In order to automatically ingest and process such huge amounts of data, single-machine, non-distributed architectures are proving to be inefficient for tasks like Big Data mining and intensive text processing and analysis. Current Natural Language Processing (NLP) systems are growing in complexity, and computational power needs have been significantly increased, requiring solutions such as distributed frameworks and parallel computing programming paradigms. This paper presents a distributed framework for executing NLP related tasks in a parallel environment. This has been achieved by integrating the APIs of the widespread GATE open source NLP platform in a multi-node cluster, built upon the open source Apache Hadoop file system. The proposed framework has been evaluated against a real corpus of web pages and documents.
A Distributed Framework for NLP-Based Keyword and Keyphrase Extraction From Web Pages and Documents / Nesi, Paolo; Pantaleo, Gianni; Sanesi, Gianmarco. - ELETTRONICO. - 2015:(2015), pp. 155-161. (Intervento presentato al convegno 21st Int. Conf. on Distributed Multimedia Systems (DMS'2015) tenutosi a Hyatt Regency, Vancouver, Canada nel August 31 - September 2, 2015) [10.18293/DMS2015-024].
A Distributed Framework for NLP-Based Keyword and Keyphrase Extraction From Web Pages and Documents
NESI, PAOLO;PANTALEO, GIANNI;
2015
Abstract
The recent growth of the World Wide Web at increasing rate and speed and the number of online available resources populating Internet represent a massive source of knowledge for various research and business interests. Such knowledge is, for the most part, embedded in the textual content of web pages and documents, which is largely represented as unstructured natural language formats. In order to automatically ingest and process such huge amounts of data, single-machine, non-distributed architectures are proving to be inefficient for tasks like Big Data mining and intensive text processing and analysis. Current Natural Language Processing (NLP) systems are growing in complexity, and computational power needs have been significantly increased, requiring solutions such as distributed frameworks and parallel computing programming paradigms. This paper presents a distributed framework for executing NLP related tasks in a parallel environment. This has been achieved by integrating the APIs of the widespread GATE open source NLP platform in a multi-node cluster, built upon the open source Apache Hadoop file system. The proposed framework has been evaluated against a real corpus of web pages and documents.I documenti in FLORE sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.