The rapid and extensive pervasion of information through the web (in the form of public and private data, including both human and automatically generated content) has enabled the diffusion of a huge amount of unstructured natural language textual resources. A great interest has arisen in the last decade for discovering, accessing and sharing such a vast source of knowledge. For this reason, processing huge amounts of data in a reasonable time frame is becoming a major challenge, as well as a crucial requirement for many commercial and research application areas. Distributed systems, computer clusters and parallel computing paradigms have been increasingly applied in the recent years, since they have introduced significant improvements for computing performance in data-intensive contexts, such as Big Data mining and analysis. Natural Language Processing is an application area which can benefit of parallel architectures. This paper presents a distributed framework for web documents crawling and execution of Natural Language Processing tasks in a parallel fashion. The system is based on the Apache Hadoop platform and its parallel programming paradigm, called MapReduce. In the specific, we implemented a MapReduce adaptation of a GATE application (a widely used open source tool for text engineering and NLP) for extracting keywords and keyphrase from web documents in a multi-node Hadoop cluster. Evaluation of performance scalability has been conducted against a real corpus of web pages and documents.

A hadoop based platform for natural language processing of web pages and documents / Nesi, Paolo; Pantaleo, Gianni; Sanesi, Gianmarco. - In: JOURNAL OF VISUAL LANGUAGES AND COMPUTING. - ISSN 1045-926X. - ELETTRONICO. - 31:(2015), pp. 130-138. [10.1016/j.jvlc.2015.10.017]

A hadoop based platform for natural language processing of web pages and documents

NESI, PAOLO;PANTALEO, GIANNI;
2015

Abstract

The rapid and extensive pervasion of information through the web (in the form of public and private data, including both human and automatically generated content) has enabled the diffusion of a huge amount of unstructured natural language textual resources. A great interest has arisen in the last decade for discovering, accessing and sharing such a vast source of knowledge. For this reason, processing huge amounts of data in a reasonable time frame is becoming a major challenge, as well as a crucial requirement for many commercial and research application areas. Distributed systems, computer clusters and parallel computing paradigms have been increasingly applied in the recent years, since they have introduced significant improvements for computing performance in data-intensive contexts, such as Big Data mining and analysis. Natural Language Processing is an application area which can benefit of parallel architectures. This paper presents a distributed framework for web documents crawling and execution of Natural Language Processing tasks in a parallel fashion. The system is based on the Apache Hadoop platform and its parallel programming paradigm, called MapReduce. In the specific, we implemented a MapReduce adaptation of a GATE application (a widely used open source tool for text engineering and NLP) for extracting keywords and keyphrase from web documents in a multi-node Hadoop cluster. Evaluation of performance scalability has been conducted against a real corpus of web pages and documents.
2015
31
130
138
Nesi, Paolo; Pantaleo, Gianni; Sanesi, Gianmarco
File in questo prodotto:
File Dimensione Formato  
1-s2.0-S1045926X14001165-main.pdf

accesso aperto

Tipologia: Pdf editoriale (Version of record)
Licenza: Open Access
Dimensione 2.12 MB
Formato Adobe PDF
2.12 MB Adobe PDF

I documenti in FLORE sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificatore per citare o creare un link a questa risorsa: https://hdl.handle.net/2158/1017351
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 32
  • ???jsp.display-item.citation.isi??? 24
social impact