Cluster generation and cluster labelling for Web snippets

Filippo Geraci, Marco Pellegrini, Marco Maggini, Fabrizio Sebastiani

Research output: Chapter in Book/Report/Conference proceedingConference contribution

40 Citations (Scopus)

Abstract

This paper describes Armil, a meta-search engine that groups into disjoint labelled clusters the Web snippets returned by auxiliary search engines. The cluster labels generated by Armil provide the user with a compact guide to assessing the relevance of each cluster to her information need. Striking the right balance between running time and cluster well-formedness was a key point in the design of our system. Both the clustering and the labelling tasks are performed on the fly by processing only the snippets provided by the auxiliary search engines, and use no external sources of knowledge. Clustering is performed by means of a fast version of the furthest-point-first algorithm for metric k-center clustering. Cluster labelling is achieved by combining intra-cluster and inter-cluster term extraction based on a variant of the information gain measure. We have tested the clustering effectiveness of Armil against Vivisimo, the de facto industrial standard in Web snippet clustering, using as benchmark a comprehensive set of snippets obtained from the Open Directory Project hierarchy. According to two widely accepted "external" metrics of clustering quality, Armil achieves better performance levels by 10%. We also report the results of a thorough user evaluation of both the clustering and the cluster labelling algorithms.

Original languageEnglish
Title of host publicationString Processing and Information Retrieval - 13th International Conference, SPIRE 2006, Proceedings
PublisherSpringer Verlag
Pages25-36
Number of pages12
ISBN (Print)3540457747, 9783540457749
Publication statusPublished - 1 Jan 2006
Event13th International Conference on String Processing and Information Retrieval, SPIRE 2006 - Glasgow, United Kingdom
Duration: 11 Oct 200613 Oct 2006

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume4209 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Other

Other13th International Conference on String Processing and Information Retrieval, SPIRE 2006
CountryUnited Kingdom
CityGlasgow
Period11/10/0613/10/06

    Fingerprint

ASJC Scopus subject areas

  • Theoretical Computer Science
  • Computer Science(all)

Cite this

Geraci, F., Pellegrini, M., Maggini, M., & Sebastiani, F. (2006). Cluster generation and cluster labelling for Web snippets. In String Processing and Information Retrieval - 13th International Conference, SPIRE 2006, Proceedings (pp. 25-36). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 4209 LNCS). Springer Verlag.