Document similarity self-join with MapReduce

Ranieri Baraglia, Gianmarco Morales, Claudio Lucchese

Research output: Chapter in Book/Report/Conference proceedingConference contribution

61 Citations (Scopus)

Abstract

Given a collection of objects, the Similarity Self-Join problem requires to discover all those pairs of objects whose similarity is above a user defined threshold. In this paper we focus on document collections, which are characterized by a sparseness that allows effective pruning strategies. Our contribution is a new parallel algorithm within the MapReduce framework. This work borrows from the state of the art in serial algorithms for similarity join and MapReduce-based techniques for set-similarity join. The proposed algorithm shows that it is possible to leverage a distributed file system to support communication patterns that do not naturally fit the MapReduce framework. Scalability is achieved by introducing a partitioning strategy able to overcome memory bottlenecks. Experimental evidence on real world data shows that our algorithm outperforms the state of the art by a factor 4.5.

Original languageEnglish
Title of host publicationProceedings - 10th IEEE International Conference on Data Mining, ICDM 2010
Pages731-736
Number of pages6
DOIs
Publication statusPublished - 2010
Externally publishedYes
Event10th IEEE International Conference on Data Mining, ICDM 2010 - Sydney, NSW
Duration: 14 Dec 201017 Dec 2010

Other

Other10th IEEE International Conference on Data Mining, ICDM 2010
CitySydney, NSW
Period14/12/1017/12/10

    Fingerprint

Keywords

  • MapReduce
  • Similarity Self-Join
  • Web information retrieval

ASJC Scopus subject areas

  • Engineering(all)

Cite this

Baraglia, R., Morales, G., & Lucchese, C. (2010). Document similarity self-join with MapReduce. In Proceedings - 10th IEEE International Conference on Data Mining, ICDM 2010 (pp. 731-736). [5694030] https://doi.org/10.1109/ICDM.2010.70