Cluster generation and labeling for web snippets

A fast, accurate hierarchical solution

Filippo Geraci, Marco Pellegrini, Marco Maggini, Fabrizio Sebastiani

Research output: Contribution to journalArticle

20 Citations (Scopus)

Abstract

This paper describes Armil, a meta-search engine that groups the web snippets returned by auxiliary search engines into disjoint labeled clusters. The cluster labels generated by Armil provide the user with a compact guide to assessing the relevance of each cluster to his/her information need. Striking the right balance between running time and cluster well-formedness was a key point in the design of our system. Both the clustering and the labeling tasks are performed on the fly by processing only the snippets provided by the auxiliary search engines, and they use no external sources of knowledge. Clustering is performed by means of a fast version of the furthest-pointfirst algorithm for metric k-center clustering. Cluster labeling is achieved by combining intra-cluster and inter-cluster term extraction based on a variant of the information gain measure. We have tested the clustering effectiveness of Armil against Vivisimo, the de facto industrial standard in web snippet clustering, using as benchmark a comprehensive set of snippets obtained from the Open Directory Project hierarchy. According to two widely accepted “external” metrics of clustering quality, Armil achieves better performance levels by 10%. We also report the results of a thorough user evaluation of both the clustering and the cluster labeling algorithms. On a standard desktop PC (AMD Athlon 1-Ghz Clock with 750 Mbytes RAM), Armil performs clustering and labeling altogether of up to 200 snippets in less than one second.

Original languageEnglish
Pages (from-to)413-443
Number of pages31
JournalInternet Mathematics
Volume3
Issue number4
DOIs
Publication statusPublished - 1 Jan 2006
Externally publishedYes

Fingerprint

Labeling
Search engines
Clustering
Search Engine
Random access storage
World Wide Web
Labels
Clocks
Labeling Algorithm
User Evaluation
Metric
Cluster Algorithm
Information Gain
Processing
Disjoint
Benchmark
Term

ASJC Scopus subject areas

  • Modelling and Simulation
  • Computational Mathematics
  • Applied Mathematics

Cite this

Cluster generation and labeling for web snippets : A fast, accurate hierarchical solution. / Geraci, Filippo; Pellegrini, Marco; Maggini, Marco; Sebastiani, Fabrizio.

In: Internet Mathematics, Vol. 3, No. 4, 01.01.2006, p. 413-443.

Research output: Contribution to journalArticle

Geraci, Filippo ; Pellegrini, Marco ; Maggini, Marco ; Sebastiani, Fabrizio. / Cluster generation and labeling for web snippets : A fast, accurate hierarchical solution. In: Internet Mathematics. 2006 ; Vol. 3, No. 4. pp. 413-443.
@article{fec2ba06062748d38f506ad0f16869b3,
title = "Cluster generation and labeling for web snippets: A fast, accurate hierarchical solution",
abstract = "This paper describes Armil, a meta-search engine that groups the web snippets returned by auxiliary search engines into disjoint labeled clusters. The cluster labels generated by Armil provide the user with a compact guide to assessing the relevance of each cluster to his/her information need. Striking the right balance between running time and cluster well-formedness was a key point in the design of our system. Both the clustering and the labeling tasks are performed on the fly by processing only the snippets provided by the auxiliary search engines, and they use no external sources of knowledge. Clustering is performed by means of a fast version of the furthest-pointfirst algorithm for metric k-center clustering. Cluster labeling is achieved by combining intra-cluster and inter-cluster term extraction based on a variant of the information gain measure. We have tested the clustering effectiveness of Armil against Vivisimo, the de facto industrial standard in web snippet clustering, using as benchmark a comprehensive set of snippets obtained from the Open Directory Project hierarchy. According to two widely accepted “external” metrics of clustering quality, Armil achieves better performance levels by 10{\%}. We also report the results of a thorough user evaluation of both the clustering and the cluster labeling algorithms. On a standard desktop PC (AMD Athlon 1-Ghz Clock with 750 Mbytes RAM), Armil performs clustering and labeling altogether of up to 200 snippets in less than one second.",
author = "Filippo Geraci and Marco Pellegrini and Marco Maggini and Fabrizio Sebastiani",
year = "2006",
month = "1",
day = "1",
doi = "10.1080/15427951.2006.10129133",
language = "English",
volume = "3",
pages = "413--443",
journal = "Internet Mathematics",
issn = "1542-7951",
publisher = "Taylor and Francis Ltd.",
number = "4",

}

TY - JOUR

T1 - Cluster generation and labeling for web snippets

T2 - A fast, accurate hierarchical solution

AU - Geraci, Filippo

AU - Pellegrini, Marco

AU - Maggini, Marco

AU - Sebastiani, Fabrizio

PY - 2006/1/1

Y1 - 2006/1/1

N2 - This paper describes Armil, a meta-search engine that groups the web snippets returned by auxiliary search engines into disjoint labeled clusters. The cluster labels generated by Armil provide the user with a compact guide to assessing the relevance of each cluster to his/her information need. Striking the right balance between running time and cluster well-formedness was a key point in the design of our system. Both the clustering and the labeling tasks are performed on the fly by processing only the snippets provided by the auxiliary search engines, and they use no external sources of knowledge. Clustering is performed by means of a fast version of the furthest-pointfirst algorithm for metric k-center clustering. Cluster labeling is achieved by combining intra-cluster and inter-cluster term extraction based on a variant of the information gain measure. We have tested the clustering effectiveness of Armil against Vivisimo, the de facto industrial standard in web snippet clustering, using as benchmark a comprehensive set of snippets obtained from the Open Directory Project hierarchy. According to two widely accepted “external” metrics of clustering quality, Armil achieves better performance levels by 10%. We also report the results of a thorough user evaluation of both the clustering and the cluster labeling algorithms. On a standard desktop PC (AMD Athlon 1-Ghz Clock with 750 Mbytes RAM), Armil performs clustering and labeling altogether of up to 200 snippets in less than one second.

AB - This paper describes Armil, a meta-search engine that groups the web snippets returned by auxiliary search engines into disjoint labeled clusters. The cluster labels generated by Armil provide the user with a compact guide to assessing the relevance of each cluster to his/her information need. Striking the right balance between running time and cluster well-formedness was a key point in the design of our system. Both the clustering and the labeling tasks are performed on the fly by processing only the snippets provided by the auxiliary search engines, and they use no external sources of knowledge. Clustering is performed by means of a fast version of the furthest-pointfirst algorithm for metric k-center clustering. Cluster labeling is achieved by combining intra-cluster and inter-cluster term extraction based on a variant of the information gain measure. We have tested the clustering effectiveness of Armil against Vivisimo, the de facto industrial standard in web snippet clustering, using as benchmark a comprehensive set of snippets obtained from the Open Directory Project hierarchy. According to two widely accepted “external” metrics of clustering quality, Armil achieves better performance levels by 10%. We also report the results of a thorough user evaluation of both the clustering and the cluster labeling algorithms. On a standard desktop PC (AMD Athlon 1-Ghz Clock with 750 Mbytes RAM), Armil performs clustering and labeling altogether of up to 200 snippets in less than one second.

UR - http://www.scopus.com/inward/record.url?scp=84924114902&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84924114902&partnerID=8YFLogxK

U2 - 10.1080/15427951.2006.10129133

DO - 10.1080/15427951.2006.10129133

M3 - Article

VL - 3

SP - 413

EP - 443

JO - Internet Mathematics

JF - Internet Mathematics

SN - 1542-7951

IS - 4

ER -