A utility-theoretic ranking method for semi-automated text classification

Giacomo Berardi, Andrea Esuli, Fabrizio Sebastiani

Research output: Chapter in Book/Report/Conference proceedingConference contribution

11 Citations (Scopus)

Abstract

In Semi-Automated Text Classification (SATC) an automatic classifier F labels a set of unlabelled documents D, following which a human annotator inspects (and corrects when appropriate) the labels attributed by F to a subset D' of D, with the aim of improving the overall quality of the labelling. An automated system can support this process by ranking the automatically labelled documents in a way that maximizes the expected increase in effectiveness that derives from inspecting D. An obvious strategy is to rank D so that the documents that F has classified with the lowest confidence are top-ranked. In this work we show that this strategy is suboptimal. We develop a new utility-theoretic ranking method based on the notion of inspection gain, defined as the improvement in classification effectiveness that would derive by inspecting and correcting a given automatically labelled document. We also propose a new effectiveness measure for SATC-oriented ranking methods, based on the expected reduction in classification error brought about by partially inspecting a list generated by a given ranking method. We report the results of experiments showing that, with respect to the baseline method above, and according to the proposed measure, our ranking method can achieve substantially higher expected reductions in classification error.

Original languageEnglish
Title of host publicationSIGIR'12 - Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval
Pages961-970
Number of pages10
DOIs
Publication statusPublished - 2012
Externally publishedYes
Event35th Annual ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2012 - Portland, OR, United States
Duration: 12 Aug 201216 Aug 2012

Other

Other35th Annual ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2012
CountryUnited States
CityPortland, OR
Period12/8/1216/8/12

Fingerprint

Labels
Labeling
Classifiers
Inspection
Experiments

Keywords

  • cost-sensitive learning
  • ranking
  • semi-automated text classification
  • supervised learning
  • text classification

ASJC Scopus subject areas

  • Information Systems

Cite this

Berardi, G., Esuli, A., & Sebastiani, F. (2012). A utility-theoretic ranking method for semi-automated text classification. In SIGIR'12 - Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 961-970) https://doi.org/10.1145/2348283.2348411

A utility-theoretic ranking method for semi-automated text classification. / Berardi, Giacomo; Esuli, Andrea; Sebastiani, Fabrizio.

SIGIR'12 - Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. 2012. p. 961-970.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Berardi, G, Esuli, A & Sebastiani, F 2012, A utility-theoretic ranking method for semi-automated text classification. in SIGIR'12 - Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. pp. 961-970, 35th Annual ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2012, Portland, OR, United States, 12/8/12. https://doi.org/10.1145/2348283.2348411
Berardi G, Esuli A, Sebastiani F. A utility-theoretic ranking method for semi-automated text classification. In SIGIR'12 - Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. 2012. p. 961-970 https://doi.org/10.1145/2348283.2348411
Berardi, Giacomo ; Esuli, Andrea ; Sebastiani, Fabrizio. / A utility-theoretic ranking method for semi-automated text classification. SIGIR'12 - Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. 2012. pp. 961-970
@inproceedings{8c6ebae70a9c4e148202aa3bb7819094,
title = "A utility-theoretic ranking method for semi-automated text classification",
abstract = "In Semi-Automated Text Classification (SATC) an automatic classifier F labels a set of unlabelled documents D, following which a human annotator inspects (and corrects when appropriate) the labels attributed by F to a subset D' of D, with the aim of improving the overall quality of the labelling. An automated system can support this process by ranking the automatically labelled documents in a way that maximizes the expected increase in effectiveness that derives from inspecting D. An obvious strategy is to rank D so that the documents that F has classified with the lowest confidence are top-ranked. In this work we show that this strategy is suboptimal. We develop a new utility-theoretic ranking method based on the notion of inspection gain, defined as the improvement in classification effectiveness that would derive by inspecting and correcting a given automatically labelled document. We also propose a new effectiveness measure for SATC-oriented ranking methods, based on the expected reduction in classification error brought about by partially inspecting a list generated by a given ranking method. We report the results of experiments showing that, with respect to the baseline method above, and according to the proposed measure, our ranking method can achieve substantially higher expected reductions in classification error.",
keywords = "cost-sensitive learning, ranking, semi-automated text classification, supervised learning, text classification",
author = "Giacomo Berardi and Andrea Esuli and Fabrizio Sebastiani",
year = "2012",
doi = "10.1145/2348283.2348411",
language = "English",
isbn = "9781450316583",
pages = "961--970",
booktitle = "SIGIR'12 - Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval",

}

TY - GEN

T1 - A utility-theoretic ranking method for semi-automated text classification

AU - Berardi, Giacomo

AU - Esuli, Andrea

AU - Sebastiani, Fabrizio

PY - 2012

Y1 - 2012

N2 - In Semi-Automated Text Classification (SATC) an automatic classifier F labels a set of unlabelled documents D, following which a human annotator inspects (and corrects when appropriate) the labels attributed by F to a subset D' of D, with the aim of improving the overall quality of the labelling. An automated system can support this process by ranking the automatically labelled documents in a way that maximizes the expected increase in effectiveness that derives from inspecting D. An obvious strategy is to rank D so that the documents that F has classified with the lowest confidence are top-ranked. In this work we show that this strategy is suboptimal. We develop a new utility-theoretic ranking method based on the notion of inspection gain, defined as the improvement in classification effectiveness that would derive by inspecting and correcting a given automatically labelled document. We also propose a new effectiveness measure for SATC-oriented ranking methods, based on the expected reduction in classification error brought about by partially inspecting a list generated by a given ranking method. We report the results of experiments showing that, with respect to the baseline method above, and according to the proposed measure, our ranking method can achieve substantially higher expected reductions in classification error.

AB - In Semi-Automated Text Classification (SATC) an automatic classifier F labels a set of unlabelled documents D, following which a human annotator inspects (and corrects when appropriate) the labels attributed by F to a subset D' of D, with the aim of improving the overall quality of the labelling. An automated system can support this process by ranking the automatically labelled documents in a way that maximizes the expected increase in effectiveness that derives from inspecting D. An obvious strategy is to rank D so that the documents that F has classified with the lowest confidence are top-ranked. In this work we show that this strategy is suboptimal. We develop a new utility-theoretic ranking method based on the notion of inspection gain, defined as the improvement in classification effectiveness that would derive by inspecting and correcting a given automatically labelled document. We also propose a new effectiveness measure for SATC-oriented ranking methods, based on the expected reduction in classification error brought about by partially inspecting a list generated by a given ranking method. We report the results of experiments showing that, with respect to the baseline method above, and according to the proposed measure, our ranking method can achieve substantially higher expected reductions in classification error.

KW - cost-sensitive learning

KW - ranking

KW - semi-automated text classification

KW - supervised learning

KW - text classification

UR - http://www.scopus.com/inward/record.url?scp=84866594882&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84866594882&partnerID=8YFLogxK

U2 - 10.1145/2348283.2348411

DO - 10.1145/2348283.2348411

M3 - Conference contribution

AN - SCOPUS:84866594882

SN - 9781450316583

SP - 961

EP - 970

BT - SIGIR'12 - Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval

ER -