Why go logarithmic if we can go linear? Towards effective distinct counting of search traffic

Ahmed Metwally, Divyakant Agrawal, Amr El Abbadi

Research output: Chapter in Book/Report/Conference proceedingConference contribution

32 Citations (Scopus)

Abstract

Estimating the number of distinct elements in a large multiset has several applications, and hence has attracted active research in the past two decades. Several sampling and sketching algorithms have been proposed to accurately solve this problem. The goal of the literature has always been to estimate the number of distinct elements while using minimal resources. However, in some modern applications, the accuracy of the estimate is of primal importance, and businesses are willing to trade more resources for better accuracy. Throughout our experience with building a distinct count system at a major search engine, Ask.com, we reviewed the literature of approximating distinct counts, and compared most algorithms in the literature. We deduced that Linear Counting, one of the least used algorithms, has unique and impressive advantages when the accuracy of the distinct count is critical to the business. For other estimators to attain comparable accuracy, they need more space than Linear Counting. We have supported our analytical results through comprehensive experiments. The experimental results highly favor Linear Counting when the number of distinct elements is large and the error tolerance is low.

Original languageEnglish
Title of host publicationAdvances in Database Technology - EDBT 2008 - 11th International Conference on Extending Database Technology, Proceedings
Pages618-629
Number of pages12
DOIs
Publication statusPublished - 16 May 2008
Externally publishedYes
Event11th International Conference on Extending Database Technology, EDBT 2008 - Nantes, France
Duration: 25 Mar 200829 Mar 2008

Other

Other11th International Conference on Extending Database Technology, EDBT 2008
CountryFrance
CityNantes
Period25/3/0829/3/08

    Fingerprint

ASJC Scopus subject areas

  • Hardware and Architecture
  • Information Systems
  • Software

Cite this

Metwally, A., Agrawal, D., & El Abbadi, A. (2008). Why go logarithmic if we can go linear? Towards effective distinct counting of search traffic. In Advances in Database Technology - EDBT 2008 - 11th International Conference on Extending Database Technology, Proceedings (pp. 618-629) https://doi.org/10.1145/1353343.1353418