Breaking the top-k barrier of hidden web databases?

Saravanan Thirumuruganathan, Nan Zhang, Gautam Das

Research output: Chapter in Book/Report/Conference proceedingConference contribution

4 Citations (Scopus)

Abstract

A large number of web databases are only accessible through proprietary form-like interfaces which require users to query the system by entering desired values for a few attributes. A key restriction enforced by such an interface is the top-k output constraint - i.e., when there are a large number of matching tuples, only a few (top-k) of them are preferentially selected and returned by the website, often according to a proprietary ranking function. Since most web database owners set k to be a small value, the top-k output constraint prevents many interesting third-party (e.g., mashup) services from being developed over real-world web databases. In this paper we consider the novel problem of "digging deeper" into such web databases. Our main contribution is the meta-algorithm GetNext that can retrieve the next ranked tuple from the hidden web database using only the restrictive interface of a web database without any prior knowledge of its ranking function. This algorithm can then be called iteratively to retrieve as many top ranked tuples as necessary. We develop principled and efficient algorithms that are based on generating and executing multiple reformulated queries and inferring the next ranked tuple from their returned results. We provide theoretical analysis of our algorithms, as well as extensive experimental results over synthetic and real-world databases that illustrate the effectiveness of our techniques.

Original languageEnglish
Title of host publicationICDE 2013 - 29th International Conference on Data Engineering
Pages1045-1056
Number of pages12
DOIs
Publication statusPublished - 15 Aug 2013
Externally publishedYes
Event29th International Conference on Data Engineering, ICDE 2013 - Brisbane, QLD, Australia
Duration: 8 Apr 201311 Apr 2013

Other

Other29th International Conference on Data Engineering, ICDE 2013
CountryAustralia
CityBrisbane, QLD
Period8/4/1311/4/13

Fingerprint

Websites

ASJC Scopus subject areas

  • Software
  • Signal Processing
  • Information Systems

Cite this

Thirumuruganathan, S., Zhang, N., & Das, G. (2013). Breaking the top-k barrier of hidden web databases? In ICDE 2013 - 29th International Conference on Data Engineering (pp. 1045-1056). [6544896] https://doi.org/10.1109/ICDE.2013.6544896

Breaking the top-k barrier of hidden web databases? / Thirumuruganathan, Saravanan; Zhang, Nan; Das, Gautam.

ICDE 2013 - 29th International Conference on Data Engineering. 2013. p. 1045-1056 6544896.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Thirumuruganathan, S, Zhang, N & Das, G 2013, Breaking the top-k barrier of hidden web databases? in ICDE 2013 - 29th International Conference on Data Engineering., 6544896, pp. 1045-1056, 29th International Conference on Data Engineering, ICDE 2013, Brisbane, QLD, Australia, 8/4/13. https://doi.org/10.1109/ICDE.2013.6544896
Thirumuruganathan S, Zhang N, Das G. Breaking the top-k barrier of hidden web databases? In ICDE 2013 - 29th International Conference on Data Engineering. 2013. p. 1045-1056. 6544896 https://doi.org/10.1109/ICDE.2013.6544896
Thirumuruganathan, Saravanan ; Zhang, Nan ; Das, Gautam. / Breaking the top-k barrier of hidden web databases?. ICDE 2013 - 29th International Conference on Data Engineering. 2013. pp. 1045-1056
@inproceedings{b1ec33520dd643ea8a65b0c3ec447dc9,
title = "Breaking the top-k barrier of hidden web databases?",
abstract = "A large number of web databases are only accessible through proprietary form-like interfaces which require users to query the system by entering desired values for a few attributes. A key restriction enforced by such an interface is the top-k output constraint - i.e., when there are a large number of matching tuples, only a few (top-k) of them are preferentially selected and returned by the website, often according to a proprietary ranking function. Since most web database owners set k to be a small value, the top-k output constraint prevents many interesting third-party (e.g., mashup) services from being developed over real-world web databases. In this paper we consider the novel problem of {"}digging deeper{"} into such web databases. Our main contribution is the meta-algorithm GetNext that can retrieve the next ranked tuple from the hidden web database using only the restrictive interface of a web database without any prior knowledge of its ranking function. This algorithm can then be called iteratively to retrieve as many top ranked tuples as necessary. We develop principled and efficient algorithms that are based on generating and executing multiple reformulated queries and inferring the next ranked tuple from their returned results. We provide theoretical analysis of our algorithms, as well as extensive experimental results over synthetic and real-world databases that illustrate the effectiveness of our techniques.",
author = "Saravanan Thirumuruganathan and Nan Zhang and Gautam Das",
year = "2013",
month = "8",
day = "15",
doi = "10.1109/ICDE.2013.6544896",
language = "English",
isbn = "9781467349086",
pages = "1045--1056",
booktitle = "ICDE 2013 - 29th International Conference on Data Engineering",

}

TY - GEN

T1 - Breaking the top-k barrier of hidden web databases?

AU - Thirumuruganathan, Saravanan

AU - Zhang, Nan

AU - Das, Gautam

PY - 2013/8/15

Y1 - 2013/8/15

N2 - A large number of web databases are only accessible through proprietary form-like interfaces which require users to query the system by entering desired values for a few attributes. A key restriction enforced by such an interface is the top-k output constraint - i.e., when there are a large number of matching tuples, only a few (top-k) of them are preferentially selected and returned by the website, often according to a proprietary ranking function. Since most web database owners set k to be a small value, the top-k output constraint prevents many interesting third-party (e.g., mashup) services from being developed over real-world web databases. In this paper we consider the novel problem of "digging deeper" into such web databases. Our main contribution is the meta-algorithm GetNext that can retrieve the next ranked tuple from the hidden web database using only the restrictive interface of a web database without any prior knowledge of its ranking function. This algorithm can then be called iteratively to retrieve as many top ranked tuples as necessary. We develop principled and efficient algorithms that are based on generating and executing multiple reformulated queries and inferring the next ranked tuple from their returned results. We provide theoretical analysis of our algorithms, as well as extensive experimental results over synthetic and real-world databases that illustrate the effectiveness of our techniques.

AB - A large number of web databases are only accessible through proprietary form-like interfaces which require users to query the system by entering desired values for a few attributes. A key restriction enforced by such an interface is the top-k output constraint - i.e., when there are a large number of matching tuples, only a few (top-k) of them are preferentially selected and returned by the website, often according to a proprietary ranking function. Since most web database owners set k to be a small value, the top-k output constraint prevents many interesting third-party (e.g., mashup) services from being developed over real-world web databases. In this paper we consider the novel problem of "digging deeper" into such web databases. Our main contribution is the meta-algorithm GetNext that can retrieve the next ranked tuple from the hidden web database using only the restrictive interface of a web database without any prior knowledge of its ranking function. This algorithm can then be called iteratively to retrieve as many top ranked tuples as necessary. We develop principled and efficient algorithms that are based on generating and executing multiple reformulated queries and inferring the next ranked tuple from their returned results. We provide theoretical analysis of our algorithms, as well as extensive experimental results over synthetic and real-world databases that illustrate the effectiveness of our techniques.

UR - http://www.scopus.com/inward/record.url?scp=84881328063&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84881328063&partnerID=8YFLogxK

U2 - 10.1109/ICDE.2013.6544896

DO - 10.1109/ICDE.2013.6544896

M3 - Conference contribution

AN - SCOPUS:84881328063

SN - 9781467349086

SP - 1045

EP - 1056

BT - ICDE 2013 - 29th International Conference on Data Engineering

ER -