Boosting multi-label hierarchical text categorization

Andrea Esuli, Tiziano Fagni, Fabrizio Sebastiani

Research output: Contribution to journalArticle

44 Citations (Scopus)

Abstract

Hierarchical Text Categorization (HTC) is the task of generating (usually by means of supervised learning algorithms) text classifiers that operate on hierarchically structured classification schemes. Notwithstanding the fact that most large-sized classification schemes for text have a hierarchical structure, so far the attention of text classification researchers has mostly focused on algorithms for "flat" classification, i.e. algorithms that operate on non-hierarchical classification schemes. These algorithms, once applied to a hierarchical classification problem, are not capable of taking advantage of the information inherent in the class hierarchy, and may thus be suboptimal, in terms of efficiency and/or effectiveness. In this paper we propose TreeBoost.MH, a multi-label HTC algorithm consisting of a hierarchical variant of AdaBoost.MH, a very well-known member of the family of "boosting" learning algorithms. TreeBoost.MH embodies several intuitions that had arisen before within HTC: e.g. the intuitions that both feature selection and the selection of negative training examples should be performed "locally", i.e. by paying attention to the topology of the classification scheme. It also embodies the novel intuition that the weight distribution that boosting algorithms update at every boosting round should likewise be updated "locally". All these intuitions are embodied within TreeBoost.MH in an elegant and simple way, i.e. by defining TreeBoost.MH as a recursive algorithm that uses AdaBoost.MH as its base step, and that recurs over the tree structure. We present the results of experimenting TreeBoost.MH on three HTC benchmarks, and discuss analytically its computational cost.

Original languageEnglish
Pages (from-to)287-313
Number of pages27
JournalInformation Retrieval
Volume11
Issue number4
DOIs
Publication statusPublished - Aug 2008
Externally publishedYes

Fingerprint

Labels
intuition
Adaptive boosting
Learning algorithms
Supervised learning
Feature extraction
Classifiers
learning
Topology
efficiency
costs
Costs

Keywords

  • Boosting
  • Hierarchical text classification

ASJC Scopus subject areas

  • Information Systems

Cite this

Boosting multi-label hierarchical text categorization. / Esuli, Andrea; Fagni, Tiziano; Sebastiani, Fabrizio.

In: Information Retrieval, Vol. 11, No. 4, 08.2008, p. 287-313.

Research output: Contribution to journalArticle

Esuli, Andrea ; Fagni, Tiziano ; Sebastiani, Fabrizio. / Boosting multi-label hierarchical text categorization. In: Information Retrieval. 2008 ; Vol. 11, No. 4. pp. 287-313.
@article{79041af6911c4ffbb35d61d38100ff62,
title = "Boosting multi-label hierarchical text categorization",
abstract = "Hierarchical Text Categorization (HTC) is the task of generating (usually by means of supervised learning algorithms) text classifiers that operate on hierarchically structured classification schemes. Notwithstanding the fact that most large-sized classification schemes for text have a hierarchical structure, so far the attention of text classification researchers has mostly focused on algorithms for {"}flat{"} classification, i.e. algorithms that operate on non-hierarchical classification schemes. These algorithms, once applied to a hierarchical classification problem, are not capable of taking advantage of the information inherent in the class hierarchy, and may thus be suboptimal, in terms of efficiency and/or effectiveness. In this paper we propose TreeBoost.MH, a multi-label HTC algorithm consisting of a hierarchical variant of AdaBoost.MH, a very well-known member of the family of {"}boosting{"} learning algorithms. TreeBoost.MH embodies several intuitions that had arisen before within HTC: e.g. the intuitions that both feature selection and the selection of negative training examples should be performed {"}locally{"}, i.e. by paying attention to the topology of the classification scheme. It also embodies the novel intuition that the weight distribution that boosting algorithms update at every boosting round should likewise be updated {"}locally{"}. All these intuitions are embodied within TreeBoost.MH in an elegant and simple way, i.e. by defining TreeBoost.MH as a recursive algorithm that uses AdaBoost.MH as its base step, and that recurs over the tree structure. We present the results of experimenting TreeBoost.MH on three HTC benchmarks, and discuss analytically its computational cost.",
keywords = "Boosting, Hierarchical text classification",
author = "Andrea Esuli and Tiziano Fagni and Fabrizio Sebastiani",
year = "2008",
month = "8",
doi = "10.1007/s10791-008-9047-y",
language = "English",
volume = "11",
pages = "287--313",
journal = "Information Retrieval",
issn = "1386-4564",
publisher = "Springer Netherlands",
number = "4",

}

TY - JOUR

T1 - Boosting multi-label hierarchical text categorization

AU - Esuli, Andrea

AU - Fagni, Tiziano

AU - Sebastiani, Fabrizio

PY - 2008/8

Y1 - 2008/8

N2 - Hierarchical Text Categorization (HTC) is the task of generating (usually by means of supervised learning algorithms) text classifiers that operate on hierarchically structured classification schemes. Notwithstanding the fact that most large-sized classification schemes for text have a hierarchical structure, so far the attention of text classification researchers has mostly focused on algorithms for "flat" classification, i.e. algorithms that operate on non-hierarchical classification schemes. These algorithms, once applied to a hierarchical classification problem, are not capable of taking advantage of the information inherent in the class hierarchy, and may thus be suboptimal, in terms of efficiency and/or effectiveness. In this paper we propose TreeBoost.MH, a multi-label HTC algorithm consisting of a hierarchical variant of AdaBoost.MH, a very well-known member of the family of "boosting" learning algorithms. TreeBoost.MH embodies several intuitions that had arisen before within HTC: e.g. the intuitions that both feature selection and the selection of negative training examples should be performed "locally", i.e. by paying attention to the topology of the classification scheme. It also embodies the novel intuition that the weight distribution that boosting algorithms update at every boosting round should likewise be updated "locally". All these intuitions are embodied within TreeBoost.MH in an elegant and simple way, i.e. by defining TreeBoost.MH as a recursive algorithm that uses AdaBoost.MH as its base step, and that recurs over the tree structure. We present the results of experimenting TreeBoost.MH on three HTC benchmarks, and discuss analytically its computational cost.

AB - Hierarchical Text Categorization (HTC) is the task of generating (usually by means of supervised learning algorithms) text classifiers that operate on hierarchically structured classification schemes. Notwithstanding the fact that most large-sized classification schemes for text have a hierarchical structure, so far the attention of text classification researchers has mostly focused on algorithms for "flat" classification, i.e. algorithms that operate on non-hierarchical classification schemes. These algorithms, once applied to a hierarchical classification problem, are not capable of taking advantage of the information inherent in the class hierarchy, and may thus be suboptimal, in terms of efficiency and/or effectiveness. In this paper we propose TreeBoost.MH, a multi-label HTC algorithm consisting of a hierarchical variant of AdaBoost.MH, a very well-known member of the family of "boosting" learning algorithms. TreeBoost.MH embodies several intuitions that had arisen before within HTC: e.g. the intuitions that both feature selection and the selection of negative training examples should be performed "locally", i.e. by paying attention to the topology of the classification scheme. It also embodies the novel intuition that the weight distribution that boosting algorithms update at every boosting round should likewise be updated "locally". All these intuitions are embodied within TreeBoost.MH in an elegant and simple way, i.e. by defining TreeBoost.MH as a recursive algorithm that uses AdaBoost.MH as its base step, and that recurs over the tree structure. We present the results of experimenting TreeBoost.MH on three HTC benchmarks, and discuss analytically its computational cost.

KW - Boosting

KW - Hierarchical text classification

UR - http://www.scopus.com/inward/record.url?scp=43949121902&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=43949121902&partnerID=8YFLogxK

U2 - 10.1007/s10791-008-9047-y

DO - 10.1007/s10791-008-9047-y

M3 - Article

AN - SCOPUS:43949121902

VL - 11

SP - 287

EP - 313

JO - Information Retrieval

JF - Information Retrieval

SN - 1386-4564

IS - 4

ER -