Query by document

Yin Yang, Nilesh Bansal, Wisam Dakka, Panagiotis Ipeirotis, Nick Koudas, Dimitris Papadias

Research output: Chapter in Book/Report/Conference proceedingConference contribution

66 Citations (Scopus)

Abstract

We are experiencing an unprecedented increase of content contributed by users in forums such as blogs, social networking sites and micro-blogging services. Such abundance of content complements content on web sites and traditional media forums such as news papers, news and financial streams, and so on. Given such plethora of information there is a pressing need to cross reference information across textual services. For example, commonly we read a news item and we wonder if there are any blogs reporting related content or vice versa. In this paper, we present techniques to automate the process of cross referencing online information content. We introduce methodologies to extract phrases from a given "query document" to be used as queries to search interfaces with the goal to retrieve content related to the query document. In particular, we consider two techniques to extract and score key phrases. We also consider techniques to complement extracted phrases with information present in external sources such as Wikipedia and introduce an algorithm called RelevanceRank for this purpose. We discuss both these techniques in detail and provide an experimental study utilizing a large number of human judges from Amazons's Mechanical Turk service. Detailed experiments demonstrate the effectiveness and efficiency of the proposed techniques for the task of automating retrieval of documents related to a query document.

Original languageEnglish
Title of host publicationProceedings of the 2nd ACM International Conference on Web Search and Data Mining, WSDM'09
Pages34-43
Number of pages10
DOIs
Publication statusPublished - 2009
Externally publishedYes
Event2nd ACM International Conference on Web Search and Data Mining, WSDM'09 - Barcelona, Spain
Duration: 9 Feb 200912 Feb 2009

Other

Other2nd ACM International Conference on Web Search and Data Mining, WSDM'09
CountrySpain
CityBarcelona
Period9/2/0912/2/09

Fingerprint

Blogs
Websites
Experiments

Keywords

  • Blog
  • Similarity matching
  • Web 2.0
  • Wikipedia

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Software

Cite this

Yang, Y., Bansal, N., Dakka, W., Ipeirotis, P., Koudas, N., & Papadias, D. (2009). Query by document. In Proceedings of the 2nd ACM International Conference on Web Search and Data Mining, WSDM'09 (pp. 34-43) https://doi.org/10.1145/1498759.1498806

Query by document. / Yang, Yin; Bansal, Nilesh; Dakka, Wisam; Ipeirotis, Panagiotis; Koudas, Nick; Papadias, Dimitris.

Proceedings of the 2nd ACM International Conference on Web Search and Data Mining, WSDM'09. 2009. p. 34-43.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Yang, Y, Bansal, N, Dakka, W, Ipeirotis, P, Koudas, N & Papadias, D 2009, Query by document. in Proceedings of the 2nd ACM International Conference on Web Search and Data Mining, WSDM'09. pp. 34-43, 2nd ACM International Conference on Web Search and Data Mining, WSDM'09, Barcelona, Spain, 9/2/09. https://doi.org/10.1145/1498759.1498806
Yang Y, Bansal N, Dakka W, Ipeirotis P, Koudas N, Papadias D. Query by document. In Proceedings of the 2nd ACM International Conference on Web Search and Data Mining, WSDM'09. 2009. p. 34-43 https://doi.org/10.1145/1498759.1498806
Yang, Yin ; Bansal, Nilesh ; Dakka, Wisam ; Ipeirotis, Panagiotis ; Koudas, Nick ; Papadias, Dimitris. / Query by document. Proceedings of the 2nd ACM International Conference on Web Search and Data Mining, WSDM'09. 2009. pp. 34-43
@inproceedings{e8e239dda1234636b9d1a0b1242f2add,
title = "Query by document",
abstract = "We are experiencing an unprecedented increase of content contributed by users in forums such as blogs, social networking sites and micro-blogging services. Such abundance of content complements content on web sites and traditional media forums such as news papers, news and financial streams, and so on. Given such plethora of information there is a pressing need to cross reference information across textual services. For example, commonly we read a news item and we wonder if there are any blogs reporting related content or vice versa. In this paper, we present techniques to automate the process of cross referencing online information content. We introduce methodologies to extract phrases from a given {"}query document{"} to be used as queries to search interfaces with the goal to retrieve content related to the query document. In particular, we consider two techniques to extract and score key phrases. We also consider techniques to complement extracted phrases with information present in external sources such as Wikipedia and introduce an algorithm called RelevanceRank for this purpose. We discuss both these techniques in detail and provide an experimental study utilizing a large number of human judges from Amazons's Mechanical Turk service. Detailed experiments demonstrate the effectiveness and efficiency of the proposed techniques for the task of automating retrieval of documents related to a query document.",
keywords = "Blog, Similarity matching, Web 2.0, Wikipedia",
author = "Yin Yang and Nilesh Bansal and Wisam Dakka and Panagiotis Ipeirotis and Nick Koudas and Dimitris Papadias",
year = "2009",
doi = "10.1145/1498759.1498806",
language = "English",
isbn = "9781605583907",
pages = "34--43",
booktitle = "Proceedings of the 2nd ACM International Conference on Web Search and Data Mining, WSDM'09",

}

TY - GEN

T1 - Query by document

AU - Yang, Yin

AU - Bansal, Nilesh

AU - Dakka, Wisam

AU - Ipeirotis, Panagiotis

AU - Koudas, Nick

AU - Papadias, Dimitris

PY - 2009

Y1 - 2009

N2 - We are experiencing an unprecedented increase of content contributed by users in forums such as blogs, social networking sites and micro-blogging services. Such abundance of content complements content on web sites and traditional media forums such as news papers, news and financial streams, and so on. Given such plethora of information there is a pressing need to cross reference information across textual services. For example, commonly we read a news item and we wonder if there are any blogs reporting related content or vice versa. In this paper, we present techniques to automate the process of cross referencing online information content. We introduce methodologies to extract phrases from a given "query document" to be used as queries to search interfaces with the goal to retrieve content related to the query document. In particular, we consider two techniques to extract and score key phrases. We also consider techniques to complement extracted phrases with information present in external sources such as Wikipedia and introduce an algorithm called RelevanceRank for this purpose. We discuss both these techniques in detail and provide an experimental study utilizing a large number of human judges from Amazons's Mechanical Turk service. Detailed experiments demonstrate the effectiveness and efficiency of the proposed techniques for the task of automating retrieval of documents related to a query document.

AB - We are experiencing an unprecedented increase of content contributed by users in forums such as blogs, social networking sites and micro-blogging services. Such abundance of content complements content on web sites and traditional media forums such as news papers, news and financial streams, and so on. Given such plethora of information there is a pressing need to cross reference information across textual services. For example, commonly we read a news item and we wonder if there are any blogs reporting related content or vice versa. In this paper, we present techniques to automate the process of cross referencing online information content. We introduce methodologies to extract phrases from a given "query document" to be used as queries to search interfaces with the goal to retrieve content related to the query document. In particular, we consider two techniques to extract and score key phrases. We also consider techniques to complement extracted phrases with information present in external sources such as Wikipedia and introduce an algorithm called RelevanceRank for this purpose. We discuss both these techniques in detail and provide an experimental study utilizing a large number of human judges from Amazons's Mechanical Turk service. Detailed experiments demonstrate the effectiveness and efficiency of the proposed techniques for the task of automating retrieval of documents related to a query document.

KW - Blog

KW - Similarity matching

KW - Web 2.0

KW - Wikipedia

UR - http://www.scopus.com/inward/record.url?scp=70349111073&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=70349111073&partnerID=8YFLogxK

U2 - 10.1145/1498759.1498806

DO - 10.1145/1498759.1498806

M3 - Conference contribution

SN - 9781605583907

SP - 34

EP - 43

BT - Proceedings of the 2nd ACM International Conference on Web Search and Data Mining, WSDM'09

ER -