Utility-theoretic ranking for semiautomated text classification

Giacomo Berardi, Andrea Esuli, Fabrizio Sebastiani

Research output: Contribution to journalArticle

5 Citations (Scopus)

Abstract

Semiautomated Text Classification (SATC) may be defined as the task of ranking a set D of automatically labelled textual documents in such a way that, if a human annotator validates (i.e., inspects and corrects where appropriate) the documents in a top-ranked portion of D with the goal of increasing the overall labelling accuracy of D, the expected increase is maximized. An obvious SATC strategy is to rank D so that the documents that the classifier has labelled with the lowest confidence are top ranked. In this work, we show that this strategy is suboptimal. We develop new utility-theoretic ranking methods based on the notion of validation gain, defined as the improvement in classification effectiveness that would derive by validating a given automatically labelled document. We also propose a new effectiveness measure for SATC-oriented ranking methods, based on the expected reduction in classification error brought about by partially validating a list generated by a given ranking method. We report the results of experiments showing that, with respect to the baseline method mentioned earlier, and according to the proposed measure, our utility-theoretic ranking methods can achieve substantially higher expected reductions in classification error.

Original languageEnglish
Article number6
JournalACM Transactions on Knowledge Discovery from Data
Volume10
Issue number1
DOIs
Publication statusPublished - 1 Jul 2015

Fingerprint

Labeling
Classifiers
Experiments

Keywords

  • Algorithm
  • Cost-sensitive learning
  • Design
  • Experimentation
  • H. [Information retrieval]: Information systems
  • I. [Machine learning]: Computing methodologies
  • Learning paradigms-supervised learning
  • Measurements
  • Ranking
  • Retrieval tasks and goals-clustering and classification
  • Semiautomated text classification
  • Supervised learning
  • Text classification

ASJC Scopus subject areas

  • Computer Science(all)

Cite this

Utility-theoretic ranking for semiautomated text classification. / Berardi, Giacomo; Esuli, Andrea; Sebastiani, Fabrizio.

In: ACM Transactions on Knowledge Discovery from Data, Vol. 10, No. 1, 6, 01.07.2015.

Research output: Contribution to journalArticle

Berardi, Giacomo ; Esuli, Andrea ; Sebastiani, Fabrizio. / Utility-theoretic ranking for semiautomated text classification. In: ACM Transactions on Knowledge Discovery from Data. 2015 ; Vol. 10, No. 1.
@article{0b680f6fa5ad49969a04122cfd058f96,
title = "Utility-theoretic ranking for semiautomated text classification",
abstract = "Semiautomated Text Classification (SATC) may be defined as the task of ranking a set D of automatically labelled textual documents in such a way that, if a human annotator validates (i.e., inspects and corrects where appropriate) the documents in a top-ranked portion of D with the goal of increasing the overall labelling accuracy of D, the expected increase is maximized. An obvious SATC strategy is to rank D so that the documents that the classifier has labelled with the lowest confidence are top ranked. In this work, we show that this strategy is suboptimal. We develop new utility-theoretic ranking methods based on the notion of validation gain, defined as the improvement in classification effectiveness that would derive by validating a given automatically labelled document. We also propose a new effectiveness measure for SATC-oriented ranking methods, based on the expected reduction in classification error brought about by partially validating a list generated by a given ranking method. We report the results of experiments showing that, with respect to the baseline method mentioned earlier, and according to the proposed measure, our utility-theoretic ranking methods can achieve substantially higher expected reductions in classification error.",
keywords = "Algorithm, Cost-sensitive learning, Design, Experimentation, H. [Information retrieval]: Information systems, I. [Machine learning]: Computing methodologies, Learning paradigms-supervised learning, Measurements, Ranking, Retrieval tasks and goals-clustering and classification, Semiautomated text classification, Supervised learning, Text classification",
author = "Giacomo Berardi and Andrea Esuli and Fabrizio Sebastiani",
year = "2015",
month = "7",
day = "1",
doi = "10.1145/2742548",
language = "English",
volume = "10",
journal = "ACM Transactions on Knowledge Discovery from Data",
issn = "1556-4681",
publisher = "Association for Computing Machinery (ACM)",
number = "1",

}

TY - JOUR

T1 - Utility-theoretic ranking for semiautomated text classification

AU - Berardi, Giacomo

AU - Esuli, Andrea

AU - Sebastiani, Fabrizio

PY - 2015/7/1

Y1 - 2015/7/1

N2 - Semiautomated Text Classification (SATC) may be defined as the task of ranking a set D of automatically labelled textual documents in such a way that, if a human annotator validates (i.e., inspects and corrects where appropriate) the documents in a top-ranked portion of D with the goal of increasing the overall labelling accuracy of D, the expected increase is maximized. An obvious SATC strategy is to rank D so that the documents that the classifier has labelled with the lowest confidence are top ranked. In this work, we show that this strategy is suboptimal. We develop new utility-theoretic ranking methods based on the notion of validation gain, defined as the improvement in classification effectiveness that would derive by validating a given automatically labelled document. We also propose a new effectiveness measure for SATC-oriented ranking methods, based on the expected reduction in classification error brought about by partially validating a list generated by a given ranking method. We report the results of experiments showing that, with respect to the baseline method mentioned earlier, and according to the proposed measure, our utility-theoretic ranking methods can achieve substantially higher expected reductions in classification error.

AB - Semiautomated Text Classification (SATC) may be defined as the task of ranking a set D of automatically labelled textual documents in such a way that, if a human annotator validates (i.e., inspects and corrects where appropriate) the documents in a top-ranked portion of D with the goal of increasing the overall labelling accuracy of D, the expected increase is maximized. An obvious SATC strategy is to rank D so that the documents that the classifier has labelled with the lowest confidence are top ranked. In this work, we show that this strategy is suboptimal. We develop new utility-theoretic ranking methods based on the notion of validation gain, defined as the improvement in classification effectiveness that would derive by validating a given automatically labelled document. We also propose a new effectiveness measure for SATC-oriented ranking methods, based on the expected reduction in classification error brought about by partially validating a list generated by a given ranking method. We report the results of experiments showing that, with respect to the baseline method mentioned earlier, and according to the proposed measure, our utility-theoretic ranking methods can achieve substantially higher expected reductions in classification error.

KW - Algorithm

KW - Cost-sensitive learning

KW - Design

KW - Experimentation

KW - H. [Information retrieval]: Information systems

KW - I. [Machine learning]: Computing methodologies

KW - Learning paradigms-supervised learning

KW - Measurements

KW - Ranking

KW - Retrieval tasks and goals-clustering and classification

KW - Semiautomated text classification

KW - Supervised learning

KW - Text classification

UR - http://www.scopus.com/inward/record.url?scp=84938385374&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84938385374&partnerID=8YFLogxK

U2 - 10.1145/2742548

DO - 10.1145/2742548

M3 - Article

VL - 10

JO - ACM Transactions on Knowledge Discovery from Data

JF - ACM Transactions on Knowledge Discovery from Data

SN - 1556-4681

IS - 1

M1 - 6

ER -