Preferential text classification: Learning algorithms and evaluation measures

Fabio Aiolli, Riccardo Cardin, Fabrizio Sebastiani, Alessandro Sperduti

Research output: Contribution to journalArticle

10 Citations (Scopus)


In many applicative contexts in which textual documents are labelled with thematic categories, a distinction is made between the primary categories of a document, which represent the topics that are central to it, and its secondary categories, which represent topics that the document only touches upon. We contend that this distinction, so far neglected in text categorization research, is important and deserves to be explicitly tackled. The contribution of this paper is threefold. First, we propose an evaluation measure for this preferential text categorization task, whereby different kinds of misclassifications involving either primary or secondary categories have a different impact on effectiveness. Second, we establish several baseline results for this task on a well-known benchmark for patent classification in which the distinction between primary and secondary categories is present; these results are obtained by reformulating the preferential text categorization task in terms of well established classification problems, such as single and/or multi-label multiclass classification; state-of-the-art learning technology such as SVMs and kernel-based methods are used. Third, we improve on these results by using a recently proposed class of algorithms explicitly devised for learning from training data expressed in preferential form, i.e., in the form "for document di, category c′ is preferred to category c″"; this allows us to distinguish between primary and secondary categories not only in the classification phase but also in the learning phase, thus differentiating their impact on the classifiers to be generated.

Original languageEnglish
Pages (from-to)559-580
Number of pages22
JournalInformation Retrieval
Issue number5
Publication statusPublished - 1 Jan 2009



  • Preferential learning
  • Primary and secondary categories
  • Supervised learning
  • Text categorization
  • Text classification

ASJC Scopus subject areas

  • Information Systems
  • Library and Information Sciences

Cite this