Scaling out All Pairs Similarity Search with MapReduce

Gianmarco Morales, Claudio Lucchese, Ranieri Baraglia

Research output: Contribution to journalArticle

5 Citations (Scopus)

Abstract

Given a collection of objects, the All Pairs Similarity Search problem involves discovering all those pairs of objects whose similarity is above a certain threshold. In this paper we focus on document collections which are characterized by a sparseness that allows effective pruning strategies. Our contribution is a new parallel algorithm within the MapReduce framework. The proposed algorithm is based on the inverted index approach and incorporates state-of-the-art pruning techniques. This is the first work that explores the feasibility of index pruning in a MapReduce algorithm. We evaluate several heuristics aimed at reducing the communication costs and the load imbalance. The resulting algorithm gives exact results up to 5× faster than the current best known solution that employs MapReduce.

Original languageEnglish
Pages (from-to)25-30
Number of pages6
JournalCEUR Workshop Proceedings
Volume630
Publication statusPublished - 2010
Externally publishedYes

Fingerprint

Parallel algorithms
Communication
Costs

ASJC Scopus subject areas

  • Computer Science(all)

Cite this

Scaling out All Pairs Similarity Search with MapReduce. / Morales, Gianmarco; Lucchese, Claudio; Baraglia, Ranieri.

In: CEUR Workshop Proceedings, Vol. 630, 2010, p. 25-30.

Research output: Contribution to journalArticle

Morales, G, Lucchese, C & Baraglia, R 2010, 'Scaling out All Pairs Similarity Search with MapReduce', CEUR Workshop Proceedings, vol. 630, pp. 25-30.
Morales, Gianmarco ; Lucchese, Claudio ; Baraglia, Ranieri. / Scaling out All Pairs Similarity Search with MapReduce. In: CEUR Workshop Proceedings. 2010 ; Vol. 630. pp. 25-30.
@article{69968400413d431d8b1561fda631bf18,
title = "Scaling out All Pairs Similarity Search with MapReduce",
abstract = "Given a collection of objects, the All Pairs Similarity Search problem involves discovering all those pairs of objects whose similarity is above a certain threshold. In this paper we focus on document collections which are characterized by a sparseness that allows effective pruning strategies. Our contribution is a new parallel algorithm within the MapReduce framework. The proposed algorithm is based on the inverted index approach and incorporates state-of-the-art pruning techniques. This is the first work that explores the feasibility of index pruning in a MapReduce algorithm. We evaluate several heuristics aimed at reducing the communication costs and the load imbalance. The resulting algorithm gives exact results up to 5× faster than the current best known solution that employs MapReduce.",
author = "Gianmarco Morales and Claudio Lucchese and Ranieri Baraglia",
year = "2010",
language = "English",
volume = "630",
pages = "25--30",
journal = "CEUR Workshop Proceedings",
issn = "1613-0073",
publisher = "CEUR-WS",

}

TY - JOUR

T1 - Scaling out All Pairs Similarity Search with MapReduce

AU - Morales, Gianmarco

AU - Lucchese, Claudio

AU - Baraglia, Ranieri

PY - 2010

Y1 - 2010

N2 - Given a collection of objects, the All Pairs Similarity Search problem involves discovering all those pairs of objects whose similarity is above a certain threshold. In this paper we focus on document collections which are characterized by a sparseness that allows effective pruning strategies. Our contribution is a new parallel algorithm within the MapReduce framework. The proposed algorithm is based on the inverted index approach and incorporates state-of-the-art pruning techniques. This is the first work that explores the feasibility of index pruning in a MapReduce algorithm. We evaluate several heuristics aimed at reducing the communication costs and the load imbalance. The resulting algorithm gives exact results up to 5× faster than the current best known solution that employs MapReduce.

AB - Given a collection of objects, the All Pairs Similarity Search problem involves discovering all those pairs of objects whose similarity is above a certain threshold. In this paper we focus on document collections which are characterized by a sparseness that allows effective pruning strategies. Our contribution is a new parallel algorithm within the MapReduce framework. The proposed algorithm is based on the inverted index approach and incorporates state-of-the-art pruning techniques. This is the first work that explores the feasibility of index pruning in a MapReduce algorithm. We evaluate several heuristics aimed at reducing the communication costs and the load imbalance. The resulting algorithm gives exact results up to 5× faster than the current best known solution that employs MapReduce.

UR - http://www.scopus.com/inward/record.url?scp=84888869336&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84888869336&partnerID=8YFLogxK

M3 - Article

VL - 630

SP - 25

EP - 30

JO - CEUR Workshop Proceedings

JF - CEUR Workshop Proceedings

SN - 1613-0073

ER -