A hadoop based platform for natural language processing of web pages and documents

Nesi, Paolo; Pantaleo, Gianni; Sanesi, Gianmarco

doi:10.1016/j.jvlc.2015.10.017

The rapid and extensive pervasion of information through the web (in the form of public and private data, including both human and automatically generated content) has enabled the diffusion of a huge amount of unstructured natural language textual resources. A great interest has arisen in the last decade for discovering, accessing and sharing such a vast source of knowledge. For this reason, processing huge amounts of data in a reasonable time frame is becoming a major challenge, as well as a crucial requirement for many commercial and research application areas. Distributed systems, computer clusters and parallel computing paradigms have been increasingly applied in the recent years, since they have introduced significant improvements for computing performance in data-intensive contexts, such as Big Data mining and analysis. Natural Language Processing is an application area which can benefit of parallel architectures. This paper presents a distributed framework for web documents crawling and execution of Natural Language Processing tasks in a parallel fashion. The system is based on the Apache Hadoop platform and its parallel programming paradigm, called MapReduce. In the specific, we implemented a MapReduce adaptation of a GATE application (a widely used open source tool for text engineering and NLP) for extracting keywords and keyphrase from web documents in a multi-node Hadoop cluster. Evaluation of performance scalability has been conducted against a real corpus of web pages and documents.

A hadoop based platform for natural language processing of web pages and documents / Nesi, P., Pantaleo, G., Sanesi, G.. - In: JOURNAL OF VISUAL LANGUAGES AND COMPUTING. - ISSN 1045-926X. - ELETTRONICO. - 31:(2015), pp. 130-138. [10.1016/j.jvlc.2015.10.017]