Although web crawlers have been around for twenty years by now, there is virtually no freely available, open-source crawling software that guarantees high throughput, overcomes the limits of single-machine systems and at the same time scales linearly with the amount of resources available. This paper aims at filling this gap, through the description of BUbiNG, our next-generation web crawler built upon the authors' experience with UbiCrawler and on the last ten years of research on the topic. BUbiNG is an open-source Java fully distributed crawler; a single BUbiNG agent, using sizeable hardware, can crawl several thousands pages per second respecting strict politeness constraints, both host- and IP-based. Unlike existing open-source distributed crawlers that rely on batch techniques (like MapReduce), BUbiNG job distribution is based on modern high-speed protocols so to achieve very high throughput.

BUbiNG: Massive Crawling for the Masses / Paolo Boldi; Andrea Marino; Massimo Santini; Sebastiano Vigna. - In: ACM TRANSACTIONS ON THE WEB. - ISSN 1559-1131. - STAMPA. - (2018), pp. 1-26.

BUbiNG: Massive Crawling for the Masses

Andrea Marino;
2018

Abstract

Although web crawlers have been around for twenty years by now, there is virtually no freely available, open-source crawling software that guarantees high throughput, overcomes the limits of single-machine systems and at the same time scales linearly with the amount of resources available. This paper aims at filling this gap, through the description of BUbiNG, our next-generation web crawler built upon the authors' experience with UbiCrawler and on the last ten years of research on the topic. BUbiNG is an open-source Java fully distributed crawler; a single BUbiNG agent, using sizeable hardware, can crawl several thousands pages per second respecting strict politeness constraints, both host- and IP-based. Unlike existing open-source distributed crawlers that rely on batch techniques (like MapReduce), BUbiNG job distribution is based on modern high-speed protocols so to achieve very high throughput.
2018
1
26
Paolo Boldi; Andrea Marino; Massimo Santini; Sebastiano Vigna
File in questo prodotto:
File Dimensione Formato  
1601.06919.pdf

Accesso chiuso

Tipologia: Pdf editoriale (Version of record)
Licenza: Tutti i diritti riservati
Dimensione 4.37 MB
Formato Adobe PDF
4.37 MB Adobe PDF   Richiedi una copia

I documenti in FLORE sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificatore per citare o creare un link a questa risorsa: https://hdl.handle.net/2158/1149243
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 22
  • ???jsp.display-item.citation.isi??? 17
social impact