We study node similarity in labeled networks, using the label sequences found in paths of bounded length q leading to the nodes. (This recalls the q-grams employed in document resemblance, based on the Jaccard distance.) When applied to networks, the challenge is two-fold: the number of q-grams generated from labeled paths grows exponentially with q, and their frequency should be taken into account: this leads to a variation of the Jaccard index known as Bray-Curtis index for multisets. We describe nSimGram, a suite of fast algorithms for node similarity with q-grams, based on a novel blend of color coding, probabilistic counting, sketches, and string algorithms, where the universe of elements to sample is exponential. We provide experimental evidence that our measure is effective and our running times scale to deal with large real-world networks.

Node Similarity with q-Grams for Real-World Labeled Networks / Alessio Conte; Gaspare Ferraro; Roberto Grossi; Andrea Marino; Kunihiko Sadakane; Takeaki Uno. - STAMPA. - (2018), pp. 1282-1291. (Intervento presentato al convegno Proceedings of the 24th ACM SIGKDD International Conference onKnowledge Discovery & Data Mining, KDD 2018) [10.1145/3219819.3220085].

Node Similarity with q-Grams for Real-World Labeled Networks

Roberto Grossi;Andrea Marino;
2018

Abstract

We study node similarity in labeled networks, using the label sequences found in paths of bounded length q leading to the nodes. (This recalls the q-grams employed in document resemblance, based on the Jaccard distance.) When applied to networks, the challenge is two-fold: the number of q-grams generated from labeled paths grows exponentially with q, and their frequency should be taken into account: this leads to a variation of the Jaccard index known as Bray-Curtis index for multisets. We describe nSimGram, a suite of fast algorithms for node similarity with q-grams, based on a novel blend of color coding, probabilistic counting, sketches, and string algorithms, where the universe of elements to sample is exponential. We provide experimental evidence that our measure is effective and our running times scale to deal with large real-world networks.
2018
Proceedings of the 24th ACM SIGKDD International Conference onKnowledge Discovery & Data Mining, KDD 2018, London, UK, August19-23, 2018
Proceedings of the 24th ACM SIGKDD International Conference onKnowledge Discovery & Data Mining, KDD 2018
Alessio Conte; Gaspare Ferraro; Roberto Grossi; Andrea Marino; Kunihiko Sadakane; Takeaki Uno
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in FLORE sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificatore per citare o creare un link a questa risorsa: https://hdl.handle.net/2158/1149232
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 10
  • ???jsp.display-item.citation.isi??? 9
social impact