Discretizing continuous attributes in AdaBoost for text categorization

Pio Nardiello, Fabrizio Sebastiani, Alessandro Sperduti

Research output: Contribution to journalArticle

18 Citations (Scopus)

Abstract

We focus on two recently proposed algorithms in the family of "boosting"-based learners for automated text classification, ADABOOST.MH and ADABOOST.MHKR. While the former is a realization of the well-known ADABOOST algorithm specifically aimed at multi-label text categorization, the latter is a generalization of the former based on the idea of learning a committee of classifier sub-committees. Both algorithms have been among the best performers in text categorization experiments so far. A problem in the use of both algorithms is that they require documents to be represented by binary vectors, indicating presence or absence of the terms in the document. As a consequence, these algorithms cannot take full advantage of the "weighted" representations (consisting of vectors of continuous attributes) that are customary in information retrieval tasks, and that provide a much more significant rendition of the document's content than binary representations. In this paper we address the problem of exploiting the potential of weighted representations in the context of ADABOOST-like algorithms by discretizing the continuous attributes through the application of entropy-based discretization methods. We present experimental results on the Reuters-21578 text categorization collection, showing that for both algorithms the version with discretized continuous attributes outperforms the version with traditional binary representations.

Original languageEnglish
Pages (from-to)320-334
Number of pages15
JournalLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume2633
Publication statusPublished - 2003
Externally publishedYes

Fingerprint

Text Categorization
Adaptive boosting
AdaBoost
Attribute
Binary
Text Classification
Discretization Method
Information Storage and Retrieval
Entropy
Boosting
Information retrieval
Information Retrieval
Labels
Classifiers
Classifier
Learning
Experimental Results
Term
Experiment

ASJC Scopus subject areas

  • Computer Science(all)
  • Biochemistry, Genetics and Molecular Biology(all)
  • Theoretical Computer Science

Cite this

Discretizing continuous attributes in AdaBoost for text categorization. / Nardiello, Pio; Sebastiani, Fabrizio; Sperduti, Alessandro.

In: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), Vol. 2633, 2003, p. 320-334.

Research output: Contribution to journalArticle

@article{c287243e863041a58aa8a3ab1142c5c8,
title = "Discretizing continuous attributes in AdaBoost for text categorization",
abstract = "We focus on two recently proposed algorithms in the family of {"}boosting{"}-based learners for automated text classification, ADABOOST.MH and ADABOOST.MHKR. While the former is a realization of the well-known ADABOOST algorithm specifically aimed at multi-label text categorization, the latter is a generalization of the former based on the idea of learning a committee of classifier sub-committees. Both algorithms have been among the best performers in text categorization experiments so far. A problem in the use of both algorithms is that they require documents to be represented by binary vectors, indicating presence or absence of the terms in the document. As a consequence, these algorithms cannot take full advantage of the {"}weighted{"} representations (consisting of vectors of continuous attributes) that are customary in information retrieval tasks, and that provide a much more significant rendition of the document's content than binary representations. In this paper we address the problem of exploiting the potential of weighted representations in the context of ADABOOST-like algorithms by discretizing the continuous attributes through the application of entropy-based discretization methods. We present experimental results on the Reuters-21578 text categorization collection, showing that for both algorithms the version with discretized continuous attributes outperforms the version with traditional binary representations.",
author = "Pio Nardiello and Fabrizio Sebastiani and Alessandro Sperduti",
year = "2003",
language = "English",
volume = "2633",
pages = "320--334",
journal = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",
issn = "0302-9743",
publisher = "Springer Verlag",

}

TY - JOUR

T1 - Discretizing continuous attributes in AdaBoost for text categorization

AU - Nardiello, Pio

AU - Sebastiani, Fabrizio

AU - Sperduti, Alessandro

PY - 2003

Y1 - 2003

N2 - We focus on two recently proposed algorithms in the family of "boosting"-based learners for automated text classification, ADABOOST.MH and ADABOOST.MHKR. While the former is a realization of the well-known ADABOOST algorithm specifically aimed at multi-label text categorization, the latter is a generalization of the former based on the idea of learning a committee of classifier sub-committees. Both algorithms have been among the best performers in text categorization experiments so far. A problem in the use of both algorithms is that they require documents to be represented by binary vectors, indicating presence or absence of the terms in the document. As a consequence, these algorithms cannot take full advantage of the "weighted" representations (consisting of vectors of continuous attributes) that are customary in information retrieval tasks, and that provide a much more significant rendition of the document's content than binary representations. In this paper we address the problem of exploiting the potential of weighted representations in the context of ADABOOST-like algorithms by discretizing the continuous attributes through the application of entropy-based discretization methods. We present experimental results on the Reuters-21578 text categorization collection, showing that for both algorithms the version with discretized continuous attributes outperforms the version with traditional binary representations.

AB - We focus on two recently proposed algorithms in the family of "boosting"-based learners for automated text classification, ADABOOST.MH and ADABOOST.MHKR. While the former is a realization of the well-known ADABOOST algorithm specifically aimed at multi-label text categorization, the latter is a generalization of the former based on the idea of learning a committee of classifier sub-committees. Both algorithms have been among the best performers in text categorization experiments so far. A problem in the use of both algorithms is that they require documents to be represented by binary vectors, indicating presence or absence of the terms in the document. As a consequence, these algorithms cannot take full advantage of the "weighted" representations (consisting of vectors of continuous attributes) that are customary in information retrieval tasks, and that provide a much more significant rendition of the document's content than binary representations. In this paper we address the problem of exploiting the potential of weighted representations in the context of ADABOOST-like algorithms by discretizing the continuous attributes through the application of entropy-based discretization methods. We present experimental results on the Reuters-21578 text categorization collection, showing that for both algorithms the version with discretized continuous attributes outperforms the version with traditional binary representations.

UR - http://www.scopus.com/inward/record.url?scp=33644485800&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=33644485800&partnerID=8YFLogxK

M3 - Article

VL - 2633

SP - 320

EP - 334

JO - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

JF - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SN - 0302-9743

ER -