Document similarity self-join with MapReduce

Ranieri Baraglia, Gianmarco Morales, Claudio Lucchese

Research output: Chapter in Book/Report/Conference proceedingConference contribution

58 Citations (Scopus)

Abstract

Given a collection of objects, the Similarity Self-Join problem requires to discover all those pairs of objects whose similarity is above a user defined threshold. In this paper we focus on document collections, which are characterized by a sparseness that allows effective pruning strategies. Our contribution is a new parallel algorithm within the MapReduce framework. This work borrows from the state of the art in serial algorithms for similarity join and MapReduce-based techniques for set-similarity join. The proposed algorithm shows that it is possible to leverage a distributed file system to support communication patterns that do not naturally fit the MapReduce framework. Scalability is achieved by introducing a partitioning strategy able to overcome memory bottlenecks. Experimental evidence on real world data shows that our algorithm outperforms the state of the art by a factor 4.5.

Original languageEnglish
Title of host publicationProceedings - 10th IEEE International Conference on Data Mining, ICDM 2010
Pages731-736
Number of pages6
DOIs
Publication statusPublished - 2010
Externally publishedYes
Event10th IEEE International Conference on Data Mining, ICDM 2010 - Sydney, NSW
Duration: 14 Dec 201017 Dec 2010

Other

Other10th IEEE International Conference on Data Mining, ICDM 2010
CitySydney, NSW
Period14/12/1017/12/10

Fingerprint

Parallel algorithms
Scalability
Data storage equipment
Communication

Keywords

  • MapReduce
  • Similarity Self-Join
  • Web information retrieval

ASJC Scopus subject areas

  • Engineering(all)

Cite this

Baraglia, R., Morales, G., & Lucchese, C. (2010). Document similarity self-join with MapReduce. In Proceedings - 10th IEEE International Conference on Data Mining, ICDM 2010 (pp. 731-736). [5694030] https://doi.org/10.1109/ICDM.2010.70

Document similarity self-join with MapReduce. / Baraglia, Ranieri; Morales, Gianmarco; Lucchese, Claudio.

Proceedings - 10th IEEE International Conference on Data Mining, ICDM 2010. 2010. p. 731-736 5694030.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Baraglia, R, Morales, G & Lucchese, C 2010, Document similarity self-join with MapReduce. in Proceedings - 10th IEEE International Conference on Data Mining, ICDM 2010., 5694030, pp. 731-736, 10th IEEE International Conference on Data Mining, ICDM 2010, Sydney, NSW, 14/12/10. https://doi.org/10.1109/ICDM.2010.70
Baraglia R, Morales G, Lucchese C. Document similarity self-join with MapReduce. In Proceedings - 10th IEEE International Conference on Data Mining, ICDM 2010. 2010. p. 731-736. 5694030 https://doi.org/10.1109/ICDM.2010.70
Baraglia, Ranieri ; Morales, Gianmarco ; Lucchese, Claudio. / Document similarity self-join with MapReduce. Proceedings - 10th IEEE International Conference on Data Mining, ICDM 2010. 2010. pp. 731-736
@inproceedings{bcc6893aaa03427096bf2552d24fe5fe,
title = "Document similarity self-join with MapReduce",
abstract = "Given a collection of objects, the Similarity Self-Join problem requires to discover all those pairs of objects whose similarity is above a user defined threshold. In this paper we focus on document collections, which are characterized by a sparseness that allows effective pruning strategies. Our contribution is a new parallel algorithm within the MapReduce framework. This work borrows from the state of the art in serial algorithms for similarity join and MapReduce-based techniques for set-similarity join. The proposed algorithm shows that it is possible to leverage a distributed file system to support communication patterns that do not naturally fit the MapReduce framework. Scalability is achieved by introducing a partitioning strategy able to overcome memory bottlenecks. Experimental evidence on real world data shows that our algorithm outperforms the state of the art by a factor 4.5.",
keywords = "MapReduce, Similarity Self-Join, Web information retrieval",
author = "Ranieri Baraglia and Gianmarco Morales and Claudio Lucchese",
year = "2010",
doi = "10.1109/ICDM.2010.70",
language = "English",
isbn = "9780769542560",
pages = "731--736",
booktitle = "Proceedings - 10th IEEE International Conference on Data Mining, ICDM 2010",

}

TY - GEN

T1 - Document similarity self-join with MapReduce

AU - Baraglia, Ranieri

AU - Morales, Gianmarco

AU - Lucchese, Claudio

PY - 2010

Y1 - 2010

N2 - Given a collection of objects, the Similarity Self-Join problem requires to discover all those pairs of objects whose similarity is above a user defined threshold. In this paper we focus on document collections, which are characterized by a sparseness that allows effective pruning strategies. Our contribution is a new parallel algorithm within the MapReduce framework. This work borrows from the state of the art in serial algorithms for similarity join and MapReduce-based techniques for set-similarity join. The proposed algorithm shows that it is possible to leverage a distributed file system to support communication patterns that do not naturally fit the MapReduce framework. Scalability is achieved by introducing a partitioning strategy able to overcome memory bottlenecks. Experimental evidence on real world data shows that our algorithm outperforms the state of the art by a factor 4.5.

AB - Given a collection of objects, the Similarity Self-Join problem requires to discover all those pairs of objects whose similarity is above a user defined threshold. In this paper we focus on document collections, which are characterized by a sparseness that allows effective pruning strategies. Our contribution is a new parallel algorithm within the MapReduce framework. This work borrows from the state of the art in serial algorithms for similarity join and MapReduce-based techniques for set-similarity join. The proposed algorithm shows that it is possible to leverage a distributed file system to support communication patterns that do not naturally fit the MapReduce framework. Scalability is achieved by introducing a partitioning strategy able to overcome memory bottlenecks. Experimental evidence on real world data shows that our algorithm outperforms the state of the art by a factor 4.5.

KW - MapReduce

KW - Similarity Self-Join

KW - Web information retrieval

UR - http://www.scopus.com/inward/record.url?scp=79951767112&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=79951767112&partnerID=8YFLogxK

U2 - 10.1109/ICDM.2010.70

DO - 10.1109/ICDM.2010.70

M3 - Conference contribution

SN - 9780769542560

SP - 731

EP - 736

BT - Proceedings - 10th IEEE International Conference on Data Mining, ICDM 2010

ER -