Cluster generation and cluster labelling for Web snippets

Filippo Geraci, Marco Pellegrini, Marco Maggini, Fabrizio Sebastiani

Research output: Chapter in Book/Report/Conference proceedingConference contribution

40 Citations (Scopus)

Abstract

This paper describes Armil, a meta-search engine that groups into disjoint labelled clusters the Web snippets returned by auxiliary search engines. The cluster labels generated by Armil provide the user with a compact guide to assessing the relevance of each cluster to her information need. Striking the right balance between running time and cluster well-formedness was a key point in the design of our system. Both the clustering and the labelling tasks are performed on the fly by processing only the snippets provided by the auxiliary search engines, and use no external sources of knowledge. Clustering is performed by means of a fast version of the furthest-point-first algorithm for metric k-center clustering. Cluster labelling is achieved by combining intra-cluster and inter-cluster term extraction based on a variant of the information gain measure. We have tested the clustering effectiveness of Armil against Vivisimo, the de facto industrial standard in Web snippet clustering, using as benchmark a comprehensive set of snippets obtained from the Open Directory Project hierarchy. According to two widely accepted "external" metrics of clustering quality, Armil achieves better performance levels by 10%. We also report the results of a thorough user evaluation of both the clustering and the cluster labelling algorithms.

Original languageEnglish
Title of host publicationLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Pages25-36
Number of pages12
Volume4209 LNCS
Publication statusPublished - 2006
Externally publishedYes
Event13th International Conference on String Processing and Information Retrieval, SPIRE 2006 - Glasgow, United Kingdom
Duration: 11 Oct 200613 Oct 2006

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume4209 LNCS
ISSN (Print)03029743
ISSN (Electronic)16113349

Other

Other13th International Conference on String Processing and Information Retrieval, SPIRE 2006
CountryUnited Kingdom
CityGlasgow
Period11/10/0613/10/06

Fingerprint

Search engines
World Wide Web
Labeling
Cluster Analysis
Clustering
Search Engine
Labels
Labeling Algorithm
Farthest Point
User Evaluation
Directories
Benchmarking
Metric
Cluster Algorithm
Information Gain
Processing
Diptera
Running
Disjoint
Benchmark

ASJC Scopus subject areas

  • Computer Science(all)
  • Biochemistry, Genetics and Molecular Biology(all)
  • Theoretical Computer Science

Cite this

Geraci, F., Pellegrini, M., Maggini, M., & Sebastiani, F. (2006). Cluster generation and cluster labelling for Web snippets. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 4209 LNCS, pp. 25-36). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 4209 LNCS).

Cluster generation and cluster labelling for Web snippets. / Geraci, Filippo; Pellegrini, Marco; Maggini, Marco; Sebastiani, Fabrizio.

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). Vol. 4209 LNCS 2006. p. 25-36 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 4209 LNCS).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Geraci, F, Pellegrini, M, Maggini, M & Sebastiani, F 2006, Cluster generation and cluster labelling for Web snippets. in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). vol. 4209 LNCS, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 4209 LNCS, pp. 25-36, 13th International Conference on String Processing and Information Retrieval, SPIRE 2006, Glasgow, United Kingdom, 11/10/06.
Geraci F, Pellegrini M, Maggini M, Sebastiani F. Cluster generation and cluster labelling for Web snippets. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). Vol. 4209 LNCS. 2006. p. 25-36. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).
Geraci, Filippo ; Pellegrini, Marco ; Maggini, Marco ; Sebastiani, Fabrizio. / Cluster generation and cluster labelling for Web snippets. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). Vol. 4209 LNCS 2006. pp. 25-36 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).
@inproceedings{64953b7a91fd481da05daa0d66e96f6b,
title = "Cluster generation and cluster labelling for Web snippets",
abstract = "This paper describes Armil, a meta-search engine that groups into disjoint labelled clusters the Web snippets returned by auxiliary search engines. The cluster labels generated by Armil provide the user with a compact guide to assessing the relevance of each cluster to her information need. Striking the right balance between running time and cluster well-formedness was a key point in the design of our system. Both the clustering and the labelling tasks are performed on the fly by processing only the snippets provided by the auxiliary search engines, and use no external sources of knowledge. Clustering is performed by means of a fast version of the furthest-point-first algorithm for metric k-center clustering. Cluster labelling is achieved by combining intra-cluster and inter-cluster term extraction based on a variant of the information gain measure. We have tested the clustering effectiveness of Armil against Vivisimo, the de facto industrial standard in Web snippet clustering, using as benchmark a comprehensive set of snippets obtained from the Open Directory Project hierarchy. According to two widely accepted {"}external{"} metrics of clustering quality, Armil achieves better performance levels by 10{\%}. We also report the results of a thorough user evaluation of both the clustering and the cluster labelling algorithms.",
author = "Filippo Geraci and Marco Pellegrini and Marco Maggini and Fabrizio Sebastiani",
year = "2006",
language = "English",
isbn = "3540457747",
volume = "4209 LNCS",
series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",
pages = "25--36",
booktitle = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",

}

TY - GEN

T1 - Cluster generation and cluster labelling for Web snippets

AU - Geraci, Filippo

AU - Pellegrini, Marco

AU - Maggini, Marco

AU - Sebastiani, Fabrizio

PY - 2006

Y1 - 2006

N2 - This paper describes Armil, a meta-search engine that groups into disjoint labelled clusters the Web snippets returned by auxiliary search engines. The cluster labels generated by Armil provide the user with a compact guide to assessing the relevance of each cluster to her information need. Striking the right balance between running time and cluster well-formedness was a key point in the design of our system. Both the clustering and the labelling tasks are performed on the fly by processing only the snippets provided by the auxiliary search engines, and use no external sources of knowledge. Clustering is performed by means of a fast version of the furthest-point-first algorithm for metric k-center clustering. Cluster labelling is achieved by combining intra-cluster and inter-cluster term extraction based on a variant of the information gain measure. We have tested the clustering effectiveness of Armil against Vivisimo, the de facto industrial standard in Web snippet clustering, using as benchmark a comprehensive set of snippets obtained from the Open Directory Project hierarchy. According to two widely accepted "external" metrics of clustering quality, Armil achieves better performance levels by 10%. We also report the results of a thorough user evaluation of both the clustering and the cluster labelling algorithms.

AB - This paper describes Armil, a meta-search engine that groups into disjoint labelled clusters the Web snippets returned by auxiliary search engines. The cluster labels generated by Armil provide the user with a compact guide to assessing the relevance of each cluster to her information need. Striking the right balance between running time and cluster well-formedness was a key point in the design of our system. Both the clustering and the labelling tasks are performed on the fly by processing only the snippets provided by the auxiliary search engines, and use no external sources of knowledge. Clustering is performed by means of a fast version of the furthest-point-first algorithm for metric k-center clustering. Cluster labelling is achieved by combining intra-cluster and inter-cluster term extraction based on a variant of the information gain measure. We have tested the clustering effectiveness of Armil against Vivisimo, the de facto industrial standard in Web snippet clustering, using as benchmark a comprehensive set of snippets obtained from the Open Directory Project hierarchy. According to two widely accepted "external" metrics of clustering quality, Armil achieves better performance levels by 10%. We also report the results of a thorough user evaluation of both the clustering and the cluster labelling algorithms.

UR - http://www.scopus.com/inward/record.url?scp=33750359861&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=33750359861&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:33750359861

SN - 3540457747

SN - 9783540457749

VL - 4209 LNCS

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 25

EP - 36

BT - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

ER -