Scaling out All Pairs Similarity Search with MapReduce

Gianmarco Morales, Claudio Lucchese, Ranieri Baraglia

Research output: Contribution to journalArticle

5 Citations (Scopus)

Abstract

Given a collection of objects, the All Pairs Similarity Search problem involves discovering all those pairs of objects whose similarity is above a certain threshold. In this paper we focus on document collections which are characterized by a sparseness that allows effective pruning strategies. Our contribution is a new parallel algorithm within the MapReduce framework. The proposed algorithm is based on the inverted index approach and incorporates state-of-the-art pruning techniques. This is the first work that explores the feasibility of index pruning in a MapReduce algorithm. We evaluate several heuristics aimed at reducing the communication costs and the load imbalance. The resulting algorithm gives exact results up to 5× faster than the current best known solution that employs MapReduce.

Original languageEnglish
Pages (from-to)25-30
Number of pages6
JournalCEUR Workshop Proceedings
Volume630
Publication statusPublished - 2010
Externally publishedYes

    Fingerprint

ASJC Scopus subject areas

  • Computer Science(all)

Cite this