Preferential text classification: Learning algorithms and evaluation measures

Fabio Aiolli, Riccardo Cardin, Fabrizio Sebastiani, Alessandro Sperduti

Research output: Contribution to journalArticle

10 Citations (Scopus)

Abstract

In many applicative contexts in which textual documents are labelled with thematic categories, a distinction is made between the primary categories of a document, which represent the topics that are central to it, and its secondary categories, which represent topics that the document only touches upon. We contend that this distinction, so far neglected in text categorization research, is important and deserves to be explicitly tackled. The contribution of this paper is threefold. First, we propose an evaluation measure for this preferential text categorization task, whereby different kinds of misclassifications involving either primary or secondary categories have a different impact on effectiveness. Second, we establish several baseline results for this task on a well-known benchmark for patent classification in which the distinction between primary and secondary categories is present; these results are obtained by reformulating the preferential text categorization task in terms of well established classification problems, such as single and/or multi-label multiclass classification; state-of-the-art learning technology such as SVMs and kernel-based methods are used. Third, we improve on these results by using a recently proposed class of algorithms explicitly devised for learning from training data expressed in preferential form, i.e., in the form "for document di, category c′ is preferred to category c″"; this allows us to distinguish between primary and secondary categories not only in the classification phase but also in the learning phase, thus differentiating their impact on the classifiers to be generated.

Original languageEnglish
Pages (from-to)559-580
Number of pages22
JournalInformation Retrieval
Volume12
Issue number5
DOIs
Publication statusPublished - 2009
Externally publishedYes

Fingerprint

Learning algorithms
evaluation
learning
Labels
Classifiers
patent

Keywords

  • Preferential learning
  • Primary and secondary categories
  • Supervised learning
  • Text categorization
  • Text classification

ASJC Scopus subject areas

  • Information Systems
  • Library and Information Sciences

Cite this

Preferential text classification : Learning algorithms and evaluation measures. / Aiolli, Fabio; Cardin, Riccardo; Sebastiani, Fabrizio; Sperduti, Alessandro.

In: Information Retrieval, Vol. 12, No. 5, 2009, p. 559-580.

Research output: Contribution to journalArticle

Aiolli, Fabio ; Cardin, Riccardo ; Sebastiani, Fabrizio ; Sperduti, Alessandro. / Preferential text classification : Learning algorithms and evaluation measures. In: Information Retrieval. 2009 ; Vol. 12, No. 5. pp. 559-580.
@article{0751dbd838ad4689a3452266896800a3,
title = "Preferential text classification: Learning algorithms and evaluation measures",
abstract = "In many applicative contexts in which textual documents are labelled with thematic categories, a distinction is made between the primary categories of a document, which represent the topics that are central to it, and its secondary categories, which represent topics that the document only touches upon. We contend that this distinction, so far neglected in text categorization research, is important and deserves to be explicitly tackled. The contribution of this paper is threefold. First, we propose an evaluation measure for this preferential text categorization task, whereby different kinds of misclassifications involving either primary or secondary categories have a different impact on effectiveness. Second, we establish several baseline results for this task on a well-known benchmark for patent classification in which the distinction between primary and secondary categories is present; these results are obtained by reformulating the preferential text categorization task in terms of well established classification problems, such as single and/or multi-label multiclass classification; state-of-the-art learning technology such as SVMs and kernel-based methods are used. Third, we improve on these results by using a recently proposed class of algorithms explicitly devised for learning from training data expressed in preferential form, i.e., in the form {"}for document di, category c′ is preferred to category c″{"}; this allows us to distinguish between primary and secondary categories not only in the classification phase but also in the learning phase, thus differentiating their impact on the classifiers to be generated.",
keywords = "Preferential learning, Primary and secondary categories, Supervised learning, Text categorization, Text classification",
author = "Fabio Aiolli and Riccardo Cardin and Fabrizio Sebastiani and Alessandro Sperduti",
year = "2009",
doi = "10.1007/s10791-008-9071-y",
language = "English",
volume = "12",
pages = "559--580",
journal = "Information Retrieval",
issn = "1386-4564",
publisher = "Springer Netherlands",
number = "5",

}

TY - JOUR

T1 - Preferential text classification

T2 - Learning algorithms and evaluation measures

AU - Aiolli, Fabio

AU - Cardin, Riccardo

AU - Sebastiani, Fabrizio

AU - Sperduti, Alessandro

PY - 2009

Y1 - 2009

N2 - In many applicative contexts in which textual documents are labelled with thematic categories, a distinction is made between the primary categories of a document, which represent the topics that are central to it, and its secondary categories, which represent topics that the document only touches upon. We contend that this distinction, so far neglected in text categorization research, is important and deserves to be explicitly tackled. The contribution of this paper is threefold. First, we propose an evaluation measure for this preferential text categorization task, whereby different kinds of misclassifications involving either primary or secondary categories have a different impact on effectiveness. Second, we establish several baseline results for this task on a well-known benchmark for patent classification in which the distinction between primary and secondary categories is present; these results are obtained by reformulating the preferential text categorization task in terms of well established classification problems, such as single and/or multi-label multiclass classification; state-of-the-art learning technology such as SVMs and kernel-based methods are used. Third, we improve on these results by using a recently proposed class of algorithms explicitly devised for learning from training data expressed in preferential form, i.e., in the form "for document di, category c′ is preferred to category c″"; this allows us to distinguish between primary and secondary categories not only in the classification phase but also in the learning phase, thus differentiating their impact on the classifiers to be generated.

AB - In many applicative contexts in which textual documents are labelled with thematic categories, a distinction is made between the primary categories of a document, which represent the topics that are central to it, and its secondary categories, which represent topics that the document only touches upon. We contend that this distinction, so far neglected in text categorization research, is important and deserves to be explicitly tackled. The contribution of this paper is threefold. First, we propose an evaluation measure for this preferential text categorization task, whereby different kinds of misclassifications involving either primary or secondary categories have a different impact on effectiveness. Second, we establish several baseline results for this task on a well-known benchmark for patent classification in which the distinction between primary and secondary categories is present; these results are obtained by reformulating the preferential text categorization task in terms of well established classification problems, such as single and/or multi-label multiclass classification; state-of-the-art learning technology such as SVMs and kernel-based methods are used. Third, we improve on these results by using a recently proposed class of algorithms explicitly devised for learning from training data expressed in preferential form, i.e., in the form "for document di, category c′ is preferred to category c″"; this allows us to distinguish between primary and secondary categories not only in the classification phase but also in the learning phase, thus differentiating their impact on the classifiers to be generated.

KW - Preferential learning

KW - Primary and secondary categories

KW - Supervised learning

KW - Text categorization

KW - Text classification

UR - http://www.scopus.com/inward/record.url?scp=84873337014&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84873337014&partnerID=8YFLogxK

U2 - 10.1007/s10791-008-9071-y

DO - 10.1007/s10791-008-9071-y

M3 - Article

AN - SCOPUS:84873337014

VL - 12

SP - 559

EP - 580

JO - Information Retrieval

JF - Information Retrieval

SN - 1386-4564

IS - 5

ER -