Scalable collection summarization and selection

R. Dolin, D. Agrawal, A. El Abbadi

Research output: Chapter in Book/Report/Conference proceedingConference contribution

13 Citations (Scopus)

Abstract

Information retrieval over the Internet increasingly requires the filtering of thousands of information sources. As the number and variety of sources increases, new ways of automatically summarizing, discovering, and selecting sources relevant to a user's query are needed. Pharos is a highly scalable distributed architecture for locating heterogeneous information sources. Its design is hierarchical, thus allowing it to scale well as the number of information sources increases. We demonstrate the feasibility of the Pharos architecture using 2500 Usenet newsgroups as separate collections. Each newsgroup is summarized via automated Library of Congress classification. We show that using Pharos as an intermediate retrieval mechanism provides acceptable accuracy of source selection compared to selecting sources using complete classification information, while maintaining good scalability. This implies that hierarchical distributed metadata and automated classification are potentially useful paradigms to address scalability problems in large-scale distributed information retrieval applications.

Original languageEnglish
Title of host publicationProceedings of the ACM International Conference on Digital Libraries
Place of PublicationNew York, NY, United States
PublisherACM
Pages49-58
Number of pages10
ISBN (Print)1581131453
Publication statusPublished - 1 Dec 1999
Externally publishedYes
EventProceedings of the 1999 4th ACM International Conference on Digital Libraries (DL'99) - Berkeley, CA, USA
Duration: 11 Aug 199914 Aug 1999

Other

OtherProceedings of the 1999 4th ACM International Conference on Digital Libraries (DL'99)
CityBerkeley, CA, USA
Period11/8/9914/8/99

Fingerprint

internet community
Information retrieval
information retrieval
source of information
Scalability
Metadata
Internet
paradigm

ASJC Scopus subject areas

  • Computer Science(all)
  • Social Sciences(all)

Cite this

Dolin, R., Agrawal, D., & El Abbadi, A. (1999). Scalable collection summarization and selection. In Proceedings of the ACM International Conference on Digital Libraries (pp. 49-58). New York, NY, United States: ACM.

Scalable collection summarization and selection. / Dolin, R.; Agrawal, D.; El Abbadi, A.

Proceedings of the ACM International Conference on Digital Libraries. New York, NY, United States : ACM, 1999. p. 49-58.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Dolin, R, Agrawal, D & El Abbadi, A 1999, Scalable collection summarization and selection. in Proceedings of the ACM International Conference on Digital Libraries. ACM, New York, NY, United States, pp. 49-58, Proceedings of the 1999 4th ACM International Conference on Digital Libraries (DL'99), Berkeley, CA, USA, 11/8/99.
Dolin R, Agrawal D, El Abbadi A. Scalable collection summarization and selection. In Proceedings of the ACM International Conference on Digital Libraries. New York, NY, United States: ACM. 1999. p. 49-58
Dolin, R. ; Agrawal, D. ; El Abbadi, A. / Scalable collection summarization and selection. Proceedings of the ACM International Conference on Digital Libraries. New York, NY, United States : ACM, 1999. pp. 49-58
@inproceedings{c3bfebd6cc7540d6825e66e80c2acff4,
title = "Scalable collection summarization and selection",
abstract = "Information retrieval over the Internet increasingly requires the filtering of thousands of information sources. As the number and variety of sources increases, new ways of automatically summarizing, discovering, and selecting sources relevant to a user's query are needed. Pharos is a highly scalable distributed architecture for locating heterogeneous information sources. Its design is hierarchical, thus allowing it to scale well as the number of information sources increases. We demonstrate the feasibility of the Pharos architecture using 2500 Usenet newsgroups as separate collections. Each newsgroup is summarized via automated Library of Congress classification. We show that using Pharos as an intermediate retrieval mechanism provides acceptable accuracy of source selection compared to selecting sources using complete classification information, while maintaining good scalability. This implies that hierarchical distributed metadata and automated classification are potentially useful paradigms to address scalability problems in large-scale distributed information retrieval applications.",
author = "R. Dolin and D. Agrawal and {El Abbadi}, A.",
year = "1999",
month = "12",
day = "1",
language = "English",
isbn = "1581131453",
pages = "49--58",
booktitle = "Proceedings of the ACM International Conference on Digital Libraries",
publisher = "ACM",

}

TY - GEN

T1 - Scalable collection summarization and selection

AU - Dolin, R.

AU - Agrawal, D.

AU - El Abbadi, A.

PY - 1999/12/1

Y1 - 1999/12/1

N2 - Information retrieval over the Internet increasingly requires the filtering of thousands of information sources. As the number and variety of sources increases, new ways of automatically summarizing, discovering, and selecting sources relevant to a user's query are needed. Pharos is a highly scalable distributed architecture for locating heterogeneous information sources. Its design is hierarchical, thus allowing it to scale well as the number of information sources increases. We demonstrate the feasibility of the Pharos architecture using 2500 Usenet newsgroups as separate collections. Each newsgroup is summarized via automated Library of Congress classification. We show that using Pharos as an intermediate retrieval mechanism provides acceptable accuracy of source selection compared to selecting sources using complete classification information, while maintaining good scalability. This implies that hierarchical distributed metadata and automated classification are potentially useful paradigms to address scalability problems in large-scale distributed information retrieval applications.

AB - Information retrieval over the Internet increasingly requires the filtering of thousands of information sources. As the number and variety of sources increases, new ways of automatically summarizing, discovering, and selecting sources relevant to a user's query are needed. Pharos is a highly scalable distributed architecture for locating heterogeneous information sources. Its design is hierarchical, thus allowing it to scale well as the number of information sources increases. We demonstrate the feasibility of the Pharos architecture using 2500 Usenet newsgroups as separate collections. Each newsgroup is summarized via automated Library of Congress classification. We show that using Pharos as an intermediate retrieval mechanism provides acceptable accuracy of source selection compared to selecting sources using complete classification information, while maintaining good scalability. This implies that hierarchical distributed metadata and automated classification are potentially useful paradigms to address scalability problems in large-scale distributed information retrieval applications.

UR - http://www.scopus.com/inward/record.url?scp=0033279050&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=0033279050&partnerID=8YFLogxK

M3 - Conference contribution

SN - 1581131453

SP - 49

EP - 58

BT - Proceedings of the ACM International Conference on Digital Libraries

PB - ACM

CY - New York, NY, United States

ER -