Streaming similarity self-join

Gianmarco Morales, Aristides Gionis

Research output: Contribution to journalArticle

10 Citations (Scopus)

Abstract

We introduce and study the problem of computing the simi- larity self-join in a streaming context (sssj), where the input is an unbounded stream of items arriving continuously. The goal is to find all pairs of items in the stream whose similarity is greater than a given threshold. The simplest formulation of the problem requires unbounded memory, and thus, it is intractable. To make the problem feasible, we introduce the notion of time-dependent similarity: the similarity of two items decreases with the difference in their arrival time. By leveraging the properties of this time-dependent sim- ilarity function, we design two algorithmic frameworks to solve the sssj problem. The first one, MiniBatch (MB), uses existing index-based filtering techniques for the static ver- sion of the problem, and combines them in a pipeline. The second framework, Streaming (STR), adds time filtering to the existing indexes, and integrates new time-based bounds deeply in the working of the algorithms. We also introduce a new indexing technique (L2), which is based on an existing state-of-the-art indexing technique (L2AP), but is optimized for the streaming case. Extensive experiments show that the STR algorithm, when instantiated with the L2 index, is the most scalable option across a wide array of datasets and parameters.

Original languageEnglish
Pages (from-to)792-803
Number of pages12
JournalProceedings of the VLDB Endowment
Volume9
Issue number10
Publication statusPublished - 2016

Fingerprint

Pipelines
Data storage equipment
Experiments

ASJC Scopus subject areas

  • Computer Science (miscellaneous)
  • Computer Science(all)

Cite this

Streaming similarity self-join. / Morales, Gianmarco; Gionis, Aristides.

In: Proceedings of the VLDB Endowment, Vol. 9, No. 10, 2016, p. 792-803.

Research output: Contribution to journalArticle

Morales, G & Gionis, A 2016, 'Streaming similarity self-join', Proceedings of the VLDB Endowment, vol. 9, no. 10, pp. 792-803.
Morales, Gianmarco ; Gionis, Aristides. / Streaming similarity self-join. In: Proceedings of the VLDB Endowment. 2016 ; Vol. 9, No. 10. pp. 792-803.
@article{8428c7cf6293411ba9bc29b3325e73f1,
title = "Streaming similarity self-join",
abstract = "We introduce and study the problem of computing the simi- larity self-join in a streaming context (sssj), where the input is an unbounded stream of items arriving continuously. The goal is to find all pairs of items in the stream whose similarity is greater than a given threshold. The simplest formulation of the problem requires unbounded memory, and thus, it is intractable. To make the problem feasible, we introduce the notion of time-dependent similarity: the similarity of two items decreases with the difference in their arrival time. By leveraging the properties of this time-dependent sim- ilarity function, we design two algorithmic frameworks to solve the sssj problem. The first one, MiniBatch (MB), uses existing index-based filtering techniques for the static ver- sion of the problem, and combines them in a pipeline. The second framework, Streaming (STR), adds time filtering to the existing indexes, and integrates new time-based bounds deeply in the working of the algorithms. We also introduce a new indexing technique (L2), which is based on an existing state-of-the-art indexing technique (L2AP), but is optimized for the streaming case. Extensive experiments show that the STR algorithm, when instantiated with the L2 index, is the most scalable option across a wide array of datasets and parameters.",
author = "Gianmarco Morales and Aristides Gionis",
year = "2016",
language = "English",
volume = "9",
pages = "792--803",
journal = "Proceedings of the VLDB Endowment",
issn = "2150-8097",
publisher = "Very Large Data Base Endowment Inc.",
number = "10",

}

TY - JOUR

T1 - Streaming similarity self-join

AU - Morales, Gianmarco

AU - Gionis, Aristides

PY - 2016

Y1 - 2016

N2 - We introduce and study the problem of computing the simi- larity self-join in a streaming context (sssj), where the input is an unbounded stream of items arriving continuously. The goal is to find all pairs of items in the stream whose similarity is greater than a given threshold. The simplest formulation of the problem requires unbounded memory, and thus, it is intractable. To make the problem feasible, we introduce the notion of time-dependent similarity: the similarity of two items decreases with the difference in their arrival time. By leveraging the properties of this time-dependent sim- ilarity function, we design two algorithmic frameworks to solve the sssj problem. The first one, MiniBatch (MB), uses existing index-based filtering techniques for the static ver- sion of the problem, and combines them in a pipeline. The second framework, Streaming (STR), adds time filtering to the existing indexes, and integrates new time-based bounds deeply in the working of the algorithms. We also introduce a new indexing technique (L2), which is based on an existing state-of-the-art indexing technique (L2AP), but is optimized for the streaming case. Extensive experiments show that the STR algorithm, when instantiated with the L2 index, is the most scalable option across a wide array of datasets and parameters.

AB - We introduce and study the problem of computing the simi- larity self-join in a streaming context (sssj), where the input is an unbounded stream of items arriving continuously. The goal is to find all pairs of items in the stream whose similarity is greater than a given threshold. The simplest formulation of the problem requires unbounded memory, and thus, it is intractable. To make the problem feasible, we introduce the notion of time-dependent similarity: the similarity of two items decreases with the difference in their arrival time. By leveraging the properties of this time-dependent sim- ilarity function, we design two algorithmic frameworks to solve the sssj problem. The first one, MiniBatch (MB), uses existing index-based filtering techniques for the static ver- sion of the problem, and combines them in a pipeline. The second framework, Streaming (STR), adds time filtering to the existing indexes, and integrates new time-based bounds deeply in the working of the algorithms. We also introduce a new indexing technique (L2), which is based on an existing state-of-the-art indexing technique (L2AP), but is optimized for the streaming case. Extensive experiments show that the STR algorithm, when instantiated with the L2 index, is the most scalable option across a wide array of datasets and parameters.

UR - http://www.scopus.com/inward/record.url?scp=84979500503&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84979500503&partnerID=8YFLogxK

M3 - Article

VL - 9

SP - 792

EP - 803

JO - Proceedings of the VLDB Endowment

JF - Proceedings of the VLDB Endowment

SN - 2150-8097

IS - 10

ER -