A Distributed Framework for NLP-Based Keyword and Keyphrase Extraction From Web Pages and Documents

Nesi, Paolo; Pantaleo, Gianni; Sanesi, Gianmarco

doi:10.18293/DMS2015-024

The recent growth of the World Wide Web at increasing rate and speed and the number of online available resources populating Internet represent a massive source of knowledge for various research and business interests. Such knowledge is, for the most part, embedded in the textual content of web pages and documents, which is largely represented as unstructured natural language formats. In order to automatically ingest and process such huge amounts of data, single-machine, non-distributed architectures are proving to be inefficient for tasks like Big Data mining and intensive text processing and analysis. Current Natural Language Processing (NLP) systems are growing in complexity, and computational power needs have been significantly increased, requiring solutions such as distributed frameworks and parallel computing programming paradigms. This paper presents a distributed framework for executing NLP related tasks in a parallel environment. This has been achieved by integrating the APIs of the widespread GATE open source NLP platform in a multi-node cluster, built upon the open source Apache Hadoop file system. The proposed framework has been evaluated against a real corpus of web pages and documents.

A Distributed Framework for NLP-Based Keyword and Keyphrase Extraction From Web Pages and Documents / Nesi, P., Pantaleo, G., Sanesi, G.. - ELETTRONICO. - 2015:(2015), pp. 155-161. (21st Int. Conf. on Distributed Multimedia Systems (DMS'2015) Hyatt Regency, Vancouver, Canada August 31 - September 2, 2015) [10.18293/DMS2015-024].