Distributional random oversampling for imbalanced text classification

Alejandro Moreo, Andrea Esuli, Fabrizio Sebastiani

Research output: Chapter in Book/Report/Conference proceedingConference contribution

9 Citations (Scopus)

Abstract

The accuracy of many classification algorithms is known to suffer when the data are imbalanced (i.e., when the distribution of the examples across the classes is severely skewed). Many applications of binary text classification are of this type, with the positive examples of the class of interest far outnumbered by the negative examples. Oversampling (i.e., generating synthetic training examples of the minority class) is an often used strategy to counter this problem. We present a new oversampling method specifically designed for classifying data (such as text) for which the distributional hypothesis holds, according to which the meaning of a feature is somehow determined by its distribution in large corpora of data. Our Distributional Random Oversampling method generates new random minority-class synthetic documents by exploiting the distributional properties of the terms in the collection. We discuss results we have obtained on the Reuters-21578, OHSUMED-S, and RCV1-v2 datasets.

Original languageEnglish
Title of host publicationSIGIR 2016 - Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval
PublisherAssociation for Computing Machinery, Inc
Pages805-808
Number of pages4
ISBN (Electronic)9781450342902
DOIs
Publication statusPublished - 7 Jul 2016
Externally publishedYes
Event39th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2016 - Pisa, Italy
Duration: 17 Jul 201621 Jul 2016

Other

Other39th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2016
CountryItaly
CityPisa
Period17/7/1621/7/16

ASJC Scopus subject areas

  • Information Systems
  • Software

Cite this

Moreo, A., Esuli, A., & Sebastiani, F. (2016). Distributional random oversampling for imbalanced text classification. In SIGIR 2016 - Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 805-808). Association for Computing Machinery, Inc. https://doi.org/10.1145/2911451.2914722

Distributional random oversampling for imbalanced text classification. / Moreo, Alejandro; Esuli, Andrea; Sebastiani, Fabrizio.

SIGIR 2016 - Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval. Association for Computing Machinery, Inc, 2016. p. 805-808.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Moreo, A, Esuli, A & Sebastiani, F 2016, Distributional random oversampling for imbalanced text classification. in SIGIR 2016 - Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval. Association for Computing Machinery, Inc, pp. 805-808, 39th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2016, Pisa, Italy, 17/7/16. https://doi.org/10.1145/2911451.2914722
Moreo A, Esuli A, Sebastiani F. Distributional random oversampling for imbalanced text classification. In SIGIR 2016 - Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval. Association for Computing Machinery, Inc. 2016. p. 805-808 https://doi.org/10.1145/2911451.2914722
Moreo, Alejandro ; Esuli, Andrea ; Sebastiani, Fabrizio. / Distributional random oversampling for imbalanced text classification. SIGIR 2016 - Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval. Association for Computing Machinery, Inc, 2016. pp. 805-808
@inproceedings{38467a3108814100a00e2a3692f0909b,
title = "Distributional random oversampling for imbalanced text classification",
abstract = "The accuracy of many classification algorithms is known to suffer when the data are imbalanced (i.e., when the distribution of the examples across the classes is severely skewed). Many applications of binary text classification are of this type, with the positive examples of the class of interest far outnumbered by the negative examples. Oversampling (i.e., generating synthetic training examples of the minority class) is an often used strategy to counter this problem. We present a new oversampling method specifically designed for classifying data (such as text) for which the distributional hypothesis holds, according to which the meaning of a feature is somehow determined by its distribution in large corpora of data. Our Distributional Random Oversampling method generates new random minority-class synthetic documents by exploiting the distributional properties of the terms in the collection. We discuss results we have obtained on the Reuters-21578, OHSUMED-S, and RCV1-v2 datasets.",
author = "Alejandro Moreo and Andrea Esuli and Fabrizio Sebastiani",
year = "2016",
month = "7",
day = "7",
doi = "10.1145/2911451.2914722",
language = "English",
pages = "805--808",
booktitle = "SIGIR 2016 - Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval",
publisher = "Association for Computing Machinery, Inc",

}

TY - GEN

T1 - Distributional random oversampling for imbalanced text classification

AU - Moreo, Alejandro

AU - Esuli, Andrea

AU - Sebastiani, Fabrizio

PY - 2016/7/7

Y1 - 2016/7/7

N2 - The accuracy of many classification algorithms is known to suffer when the data are imbalanced (i.e., when the distribution of the examples across the classes is severely skewed). Many applications of binary text classification are of this type, with the positive examples of the class of interest far outnumbered by the negative examples. Oversampling (i.e., generating synthetic training examples of the minority class) is an often used strategy to counter this problem. We present a new oversampling method specifically designed for classifying data (such as text) for which the distributional hypothesis holds, according to which the meaning of a feature is somehow determined by its distribution in large corpora of data. Our Distributional Random Oversampling method generates new random minority-class synthetic documents by exploiting the distributional properties of the terms in the collection. We discuss results we have obtained on the Reuters-21578, OHSUMED-S, and RCV1-v2 datasets.

AB - The accuracy of many classification algorithms is known to suffer when the data are imbalanced (i.e., when the distribution of the examples across the classes is severely skewed). Many applications of binary text classification are of this type, with the positive examples of the class of interest far outnumbered by the negative examples. Oversampling (i.e., generating synthetic training examples of the minority class) is an often used strategy to counter this problem. We present a new oversampling method specifically designed for classifying data (such as text) for which the distributional hypothesis holds, according to which the meaning of a feature is somehow determined by its distribution in large corpora of data. Our Distributional Random Oversampling method generates new random minority-class synthetic documents by exploiting the distributional properties of the terms in the collection. We discuss results we have obtained on the Reuters-21578, OHSUMED-S, and RCV1-v2 datasets.

UR - http://www.scopus.com/inward/record.url?scp=84980410173&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84980410173&partnerID=8YFLogxK

U2 - 10.1145/2911451.2914722

DO - 10.1145/2911451.2914722

M3 - Conference contribution

AN - SCOPUS:84980410173

SP - 805

EP - 808

BT - SIGIR 2016 - Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval

PB - Association for Computing Machinery, Inc

ER -