Supervised term weighting for automated text categorization

Franca Debole, Fabrizio Sebastiani

Research output: Chapter in Book/Report/Conference proceedingConference contribution

206 Citations (Scopus)

Abstract

The construction of a text classifier usually involves (i) a phase of term selection, in which the most relevant terms for the classification task are identified, (ii) a phase of term weighting, in which document weights for the selected terms are computed, and (iii) a phase of classifier learning, in which a classifier is generated from the weighted representations of the training documents, This process involves an activity of supervised learning, in which information on the membership of training documents in categories is used. Traditionally, supervised learning enters only phases (i) and (iii). In this paper we propose instead that learning from training data should also affect phase (ii), i.e. that information on the membership of training documents to categories be used to determine term weights. We call this idea supervised term weighting (STW). As an example, we propose a number of "supervised variants" of tfidf weighting, obtained by replacing the idf function with the function that has been used in phase (i) for term selection. We present experimental results obtained on the standard Reuters-21578 benchmark with one classifier learning method (support vector machines), three term selection functions (information gain, chi-square, and gain ratio), and both local and global term selection and weighting.

Original languageEnglish
Title of host publicationProceedings of the ACM Symposium on Applied Computing
EditorsG. Lamont
Pages784-788
Number of pages5
Publication statusPublished - 2003
Externally publishedYes
EventProceedings of the 2003 ACM Symposium on Applied Computing - Melbourne, FL
Duration: 9 Mar 200312 Mar 2003

Other

OtherProceedings of the 2003 ACM Symposium on Applied Computing
CityMelbourne, FL
Period9/3/0312/3/03

Fingerprint

Classifiers
Supervised learning
Support vector machines

Keywords

  • Machine learning
  • Text categorization
  • Text classification

ASJC Scopus subject areas

  • Computer Science(all)

Cite this

Debole, F., & Sebastiani, F. (2003). Supervised term weighting for automated text categorization. In G. Lamont (Ed.), Proceedings of the ACM Symposium on Applied Computing (pp. 784-788)

Supervised term weighting for automated text categorization. / Debole, Franca; Sebastiani, Fabrizio.

Proceedings of the ACM Symposium on Applied Computing. ed. / G. Lamont. 2003. p. 784-788.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Debole, F & Sebastiani, F 2003, Supervised term weighting for automated text categorization. in G Lamont (ed.), Proceedings of the ACM Symposium on Applied Computing. pp. 784-788, Proceedings of the 2003 ACM Symposium on Applied Computing, Melbourne, FL, 9/3/03.
Debole F, Sebastiani F. Supervised term weighting for automated text categorization. In Lamont G, editor, Proceedings of the ACM Symposium on Applied Computing. 2003. p. 784-788
Debole, Franca ; Sebastiani, Fabrizio. / Supervised term weighting for automated text categorization. Proceedings of the ACM Symposium on Applied Computing. editor / G. Lamont. 2003. pp. 784-788
@inproceedings{96b733f42a664b87b6fb0af8c507e22c,
title = "Supervised term weighting for automated text categorization",
abstract = "The construction of a text classifier usually involves (i) a phase of term selection, in which the most relevant terms for the classification task are identified, (ii) a phase of term weighting, in which document weights for the selected terms are computed, and (iii) a phase of classifier learning, in which a classifier is generated from the weighted representations of the training documents, This process involves an activity of supervised learning, in which information on the membership of training documents in categories is used. Traditionally, supervised learning enters only phases (i) and (iii). In this paper we propose instead that learning from training data should also affect phase (ii), i.e. that information on the membership of training documents to categories be used to determine term weights. We call this idea supervised term weighting (STW). As an example, we propose a number of {"}supervised variants{"} of tfidf weighting, obtained by replacing the idf function with the function that has been used in phase (i) for term selection. We present experimental results obtained on the standard Reuters-21578 benchmark with one classifier learning method (support vector machines), three term selection functions (information gain, chi-square, and gain ratio), and both local and global term selection and weighting.",
keywords = "Machine learning, Text categorization, Text classification",
author = "Franca Debole and Fabrizio Sebastiani",
year = "2003",
language = "English",
pages = "784--788",
editor = "G. Lamont",
booktitle = "Proceedings of the ACM Symposium on Applied Computing",

}

TY - GEN

T1 - Supervised term weighting for automated text categorization

AU - Debole, Franca

AU - Sebastiani, Fabrizio

PY - 2003

Y1 - 2003

N2 - The construction of a text classifier usually involves (i) a phase of term selection, in which the most relevant terms for the classification task are identified, (ii) a phase of term weighting, in which document weights for the selected terms are computed, and (iii) a phase of classifier learning, in which a classifier is generated from the weighted representations of the training documents, This process involves an activity of supervised learning, in which information on the membership of training documents in categories is used. Traditionally, supervised learning enters only phases (i) and (iii). In this paper we propose instead that learning from training data should also affect phase (ii), i.e. that information on the membership of training documents to categories be used to determine term weights. We call this idea supervised term weighting (STW). As an example, we propose a number of "supervised variants" of tfidf weighting, obtained by replacing the idf function with the function that has been used in phase (i) for term selection. We present experimental results obtained on the standard Reuters-21578 benchmark with one classifier learning method (support vector machines), three term selection functions (information gain, chi-square, and gain ratio), and both local and global term selection and weighting.

AB - The construction of a text classifier usually involves (i) a phase of term selection, in which the most relevant terms for the classification task are identified, (ii) a phase of term weighting, in which document weights for the selected terms are computed, and (iii) a phase of classifier learning, in which a classifier is generated from the weighted representations of the training documents, This process involves an activity of supervised learning, in which information on the membership of training documents in categories is used. Traditionally, supervised learning enters only phases (i) and (iii). In this paper we propose instead that learning from training data should also affect phase (ii), i.e. that information on the membership of training documents to categories be used to determine term weights. We call this idea supervised term weighting (STW). As an example, we propose a number of "supervised variants" of tfidf weighting, obtained by replacing the idf function with the function that has been used in phase (i) for term selection. We present experimental results obtained on the standard Reuters-21578 benchmark with one classifier learning method (support vector machines), three term selection functions (information gain, chi-square, and gain ratio), and both local and global term selection and weighting.

KW - Machine learning

KW - Text categorization

KW - Text classification

UR - http://www.scopus.com/inward/record.url?scp=0037998887&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=0037998887&partnerID=8YFLogxK

M3 - Conference contribution

SP - 784

EP - 788

BT - Proceedings of the ACM Symposium on Applied Computing

A2 - Lamont, G.

ER -