Optimizing text quantifiers for multivariate loss functions

Andrea Esuli, Fabrizio Sebastiani

Research output: Contribution to journalArticle

16 Citations (Scopus)

Abstract

We address the problem of quantification, a supervised learning task whose goal is, given a class, to estimate the relative frequency (or prevalence) of the class in a dataset of unlabeled items. Quantification has several applications in data and text mining, such as estimating the prevalence of positive reviews in a set of reviews of a given product or estimating the prevalence of a given support issue in a dataset of transcripts of phone calls to tech support. So far, quantification has been addressed by learning a general-purpose classifier, counting the unlabeled items that have been assigned the class, and tuning the obtained counts according to some heuristics. In this article, we depart from the tradition of using general-purpose classifiers and use instead a supervised learning model for structured prediction, capable of generating classifiers directly optimized for the (multivariate and nonlinear) function used for evaluating quantification accuracy. The experiments that we have run on 5,500 binary high-dimensional datasets (averaging more than 14,000 documents each) show that this method is more accurate, more stable, and more efficient than existing state-of-the-art quantification methods.

Original languageEnglish
JournalACM Transactions on Knowledge Discovery from Data
Volume9
Issue number4
DOIs
Publication statusPublished - 1 Jun 2015

Fingerprint

Classifiers
Supervised learning
Tuning
Experiments

Keywords

  • Kullback-Leibler divergence
  • Loss functions
  • Prevalence estimation
  • Prior estimation
  • Quantification
  • Supervised learning
  • Text classification

ASJC Scopus subject areas

  • Computer Science(all)

Cite this

Optimizing text quantifiers for multivariate loss functions. / Esuli, Andrea; Sebastiani, Fabrizio.

In: ACM Transactions on Knowledge Discovery from Data, Vol. 9, No. 4, 01.06.2015.

Research output: Contribution to journalArticle

@article{8820f2330f9d43ed8d4c0b242b675c60,
title = "Optimizing text quantifiers for multivariate loss functions",
abstract = "We address the problem of quantification, a supervised learning task whose goal is, given a class, to estimate the relative frequency (or prevalence) of the class in a dataset of unlabeled items. Quantification has several applications in data and text mining, such as estimating the prevalence of positive reviews in a set of reviews of a given product or estimating the prevalence of a given support issue in a dataset of transcripts of phone calls to tech support. So far, quantification has been addressed by learning a general-purpose classifier, counting the unlabeled items that have been assigned the class, and tuning the obtained counts according to some heuristics. In this article, we depart from the tradition of using general-purpose classifiers and use instead a supervised learning model for structured prediction, capable of generating classifiers directly optimized for the (multivariate and nonlinear) function used for evaluating quantification accuracy. The experiments that we have run on 5,500 binary high-dimensional datasets (averaging more than 14,000 documents each) show that this method is more accurate, more stable, and more efficient than existing state-of-the-art quantification methods.",
keywords = "Kullback-Leibler divergence, Loss functions, Prevalence estimation, Prior estimation, Quantification, Supervised learning, Text classification",
author = "Andrea Esuli and Fabrizio Sebastiani",
year = "2015",
month = "6",
day = "1",
doi = "10.1145/2700406",
language = "English",
volume = "9",
journal = "ACM Transactions on Knowledge Discovery from Data",
issn = "1556-4681",
publisher = "Association for Computing Machinery (ACM)",
number = "4",

}

TY - JOUR

T1 - Optimizing text quantifiers for multivariate loss functions

AU - Esuli, Andrea

AU - Sebastiani, Fabrizio

PY - 2015/6/1

Y1 - 2015/6/1

N2 - We address the problem of quantification, a supervised learning task whose goal is, given a class, to estimate the relative frequency (or prevalence) of the class in a dataset of unlabeled items. Quantification has several applications in data and text mining, such as estimating the prevalence of positive reviews in a set of reviews of a given product or estimating the prevalence of a given support issue in a dataset of transcripts of phone calls to tech support. So far, quantification has been addressed by learning a general-purpose classifier, counting the unlabeled items that have been assigned the class, and tuning the obtained counts according to some heuristics. In this article, we depart from the tradition of using general-purpose classifiers and use instead a supervised learning model for structured prediction, capable of generating classifiers directly optimized for the (multivariate and nonlinear) function used for evaluating quantification accuracy. The experiments that we have run on 5,500 binary high-dimensional datasets (averaging more than 14,000 documents each) show that this method is more accurate, more stable, and more efficient than existing state-of-the-art quantification methods.

AB - We address the problem of quantification, a supervised learning task whose goal is, given a class, to estimate the relative frequency (or prevalence) of the class in a dataset of unlabeled items. Quantification has several applications in data and text mining, such as estimating the prevalence of positive reviews in a set of reviews of a given product or estimating the prevalence of a given support issue in a dataset of transcripts of phone calls to tech support. So far, quantification has been addressed by learning a general-purpose classifier, counting the unlabeled items that have been assigned the class, and tuning the obtained counts according to some heuristics. In this article, we depart from the tradition of using general-purpose classifiers and use instead a supervised learning model for structured prediction, capable of generating classifiers directly optimized for the (multivariate and nonlinear) function used for evaluating quantification accuracy. The experiments that we have run on 5,500 binary high-dimensional datasets (averaging more than 14,000 documents each) show that this method is more accurate, more stable, and more efficient than existing state-of-the-art quantification methods.

KW - Kullback-Leibler divergence

KW - Loss functions

KW - Prevalence estimation

KW - Prior estimation

KW - Quantification

KW - Supervised learning

KW - Text classification

UR - http://www.scopus.com/inward/record.url?scp=84930792020&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84930792020&partnerID=8YFLogxK

U2 - 10.1145/2700406

DO - 10.1145/2700406

M3 - Article

VL - 9

JO - ACM Transactions on Knowledge Discovery from Data

JF - ACM Transactions on Knowledge Discovery from Data

SN - 1556-4681

IS - 4

ER -