Why go logarithmic if we can go linear? Towards effective distinct counting of search traffic

Ahmed Metwally, Divyakant Agrawal, Amr El Abbadi

Research output: Chapter in Book/Report/Conference proceedingConference contribution

30 Citations (Scopus)

Abstract

Estimating the number of distinct elements in a large multiset has several applications, and hence has attracted active research in the past two decades. Several sampling and sketching algorithms have been proposed to accurately solve this problem. The goal of the literature has always been to estimate the number of distinct elements while using minimal resources. However, in some modern applications, the accuracy of the estimate is of primal importance, and businesses are willing to trade more resources for better accuracy. Throughout our experience with building a distinct count system at a major search engine, Ask.com, we reviewed the literature of approximating distinct counts, and compared most algorithms in the literature. We deduced that Linear Counting, one of the least used algorithms, has unique and impressive advantages when the accuracy of the distinct count is critical to the business. For other estimators to attain comparable accuracy, they need more space than Linear Counting. We have supported our analytical results through comprehensive experiments. The experimental results highly favor Linear Counting when the number of distinct elements is large and the error tolerance is low.

Original languageEnglish
Title of host publicationAdvances in Database Technology - EDBT 2008 - 11th International Conference on Extending Database Technology, Proceedings
Pages618-629
Number of pages12
DOIs
Publication statusPublished - 16 May 2008
Externally publishedYes
Event11th International Conference on Extending Database Technology, EDBT 2008 - Nantes, France
Duration: 25 Mar 200829 Mar 2008

Other

Other11th International Conference on Extending Database Technology, EDBT 2008
CountryFrance
CityNantes
Period25/3/0829/3/08

Fingerprint

Search engines
Industry
Sampling
Experiments

ASJC Scopus subject areas

  • Hardware and Architecture
  • Information Systems
  • Software

Cite this

Metwally, A., Agrawal, D., & El Abbadi, A. (2008). Why go logarithmic if we can go linear? Towards effective distinct counting of search traffic. In Advances in Database Technology - EDBT 2008 - 11th International Conference on Extending Database Technology, Proceedings (pp. 618-629) https://doi.org/10.1145/1353343.1353418

Why go logarithmic if we can go linear? Towards effective distinct counting of search traffic. / Metwally, Ahmed; Agrawal, Divyakant; El Abbadi, Amr.

Advances in Database Technology - EDBT 2008 - 11th International Conference on Extending Database Technology, Proceedings. 2008. p. 618-629.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Metwally, A, Agrawal, D & El Abbadi, A 2008, Why go logarithmic if we can go linear? Towards effective distinct counting of search traffic. in Advances in Database Technology - EDBT 2008 - 11th International Conference on Extending Database Technology, Proceedings. pp. 618-629, 11th International Conference on Extending Database Technology, EDBT 2008, Nantes, France, 25/3/08. https://doi.org/10.1145/1353343.1353418
Metwally A, Agrawal D, El Abbadi A. Why go logarithmic if we can go linear? Towards effective distinct counting of search traffic. In Advances in Database Technology - EDBT 2008 - 11th International Conference on Extending Database Technology, Proceedings. 2008. p. 618-629 https://doi.org/10.1145/1353343.1353418
Metwally, Ahmed ; Agrawal, Divyakant ; El Abbadi, Amr. / Why go logarithmic if we can go linear? Towards effective distinct counting of search traffic. Advances in Database Technology - EDBT 2008 - 11th International Conference on Extending Database Technology, Proceedings. 2008. pp. 618-629
@inproceedings{caf670dc1a474a898f0a8940d58f15e0,
title = "Why go logarithmic if we can go linear? Towards effective distinct counting of search traffic",
abstract = "Estimating the number of distinct elements in a large multiset has several applications, and hence has attracted active research in the past two decades. Several sampling and sketching algorithms have been proposed to accurately solve this problem. The goal of the literature has always been to estimate the number of distinct elements while using minimal resources. However, in some modern applications, the accuracy of the estimate is of primal importance, and businesses are willing to trade more resources for better accuracy. Throughout our experience with building a distinct count system at a major search engine, Ask.com, we reviewed the literature of approximating distinct counts, and compared most algorithms in the literature. We deduced that Linear Counting, one of the least used algorithms, has unique and impressive advantages when the accuracy of the distinct count is critical to the business. For other estimators to attain comparable accuracy, they need more space than Linear Counting. We have supported our analytical results through comprehensive experiments. The experimental results highly favor Linear Counting when the number of distinct elements is large and the error tolerance is low.",
author = "Ahmed Metwally and Divyakant Agrawal and {El Abbadi}, Amr",
year = "2008",
month = "5",
day = "16",
doi = "10.1145/1353343.1353418",
language = "English",
isbn = "9781595939265",
pages = "618--629",
booktitle = "Advances in Database Technology - EDBT 2008 - 11th International Conference on Extending Database Technology, Proceedings",

}

TY - GEN

T1 - Why go logarithmic if we can go linear? Towards effective distinct counting of search traffic

AU - Metwally, Ahmed

AU - Agrawal, Divyakant

AU - El Abbadi, Amr

PY - 2008/5/16

Y1 - 2008/5/16

N2 - Estimating the number of distinct elements in a large multiset has several applications, and hence has attracted active research in the past two decades. Several sampling and sketching algorithms have been proposed to accurately solve this problem. The goal of the literature has always been to estimate the number of distinct elements while using minimal resources. However, in some modern applications, the accuracy of the estimate is of primal importance, and businesses are willing to trade more resources for better accuracy. Throughout our experience with building a distinct count system at a major search engine, Ask.com, we reviewed the literature of approximating distinct counts, and compared most algorithms in the literature. We deduced that Linear Counting, one of the least used algorithms, has unique and impressive advantages when the accuracy of the distinct count is critical to the business. For other estimators to attain comparable accuracy, they need more space than Linear Counting. We have supported our analytical results through comprehensive experiments. The experimental results highly favor Linear Counting when the number of distinct elements is large and the error tolerance is low.

AB - Estimating the number of distinct elements in a large multiset has several applications, and hence has attracted active research in the past two decades. Several sampling and sketching algorithms have been proposed to accurately solve this problem. The goal of the literature has always been to estimate the number of distinct elements while using minimal resources. However, in some modern applications, the accuracy of the estimate is of primal importance, and businesses are willing to trade more resources for better accuracy. Throughout our experience with building a distinct count system at a major search engine, Ask.com, we reviewed the literature of approximating distinct counts, and compared most algorithms in the literature. We deduced that Linear Counting, one of the least used algorithms, has unique and impressive advantages when the accuracy of the distinct count is critical to the business. For other estimators to attain comparable accuracy, they need more space than Linear Counting. We have supported our analytical results through comprehensive experiments. The experimental results highly favor Linear Counting when the number of distinct elements is large and the error tolerance is low.

UR - http://www.scopus.com/inward/record.url?scp=43349089289&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=43349089289&partnerID=8YFLogxK

U2 - 10.1145/1353343.1353418

DO - 10.1145/1353343.1353418

M3 - Conference contribution

SN - 9781595939265

SP - 618

EP - 629

BT - Advances in Database Technology - EDBT 2008 - 11th International Conference on Extending Database Technology, Proceedings

ER -