Utility-theoretic ranking for semiautomated text classification

Giacomo Berardi, Andrea Esuli, Fabrizio Sebastiani

Research output: Contribution to journalArticle

5 Citations (Scopus)


Semiautomated Text Classification (SATC) may be defined as the task of ranking a set D of automatically labelled textual documents in such a way that, if a human annotator validates (i.e., inspects and corrects where appropriate) the documents in a top-ranked portion of D with the goal of increasing the overall labelling accuracy of D, the expected increase is maximized. An obvious SATC strategy is to rank D so that the documents that the classifier has labelled with the lowest confidence are top ranked. In this work, we show that this strategy is suboptimal. We develop new utility-theoretic ranking methods based on the notion of validation gain, defined as the improvement in classification effectiveness that would derive by validating a given automatically labelled document. We also propose a new effectiveness measure for SATC-oriented ranking methods, based on the expected reduction in classification error brought about by partially validating a list generated by a given ranking method. We report the results of experiments showing that, with respect to the baseline method mentioned earlier, and according to the proposed measure, our utility-theoretic ranking methods can achieve substantially higher expected reductions in classification error.

Original languageEnglish
Article number6
JournalACM Transactions on Knowledge Discovery from Data
Issue number1
Publication statusPublished - 1 Jul 2015



  • Algorithm
  • Cost-sensitive learning
  • Design
  • Experimentation
  • H. [Information retrieval]: Information systems
  • I. [Machine learning]: Computing methodologies
  • Learning paradigms-supervised learning
  • Measurements
  • Ranking
  • Retrieval tasks and goals-clustering and classification
  • Semiautomated text classification
  • Supervised learning
  • Text classification

ASJC Scopus subject areas

  • Computer Science(all)

Cite this