Machine Learning in Automated Text Categorization

Fabrizio Sebastiani

Research output: Contribution to journalArticle

4769 Citations (Scopus)

Abstract

The automated categorization (or classification) of texts into predefined categories has witnessed a booming interest in the last 10 years, due to the increased availability of documents in digital form and the ensuing need to organize them. In the research community the dominant approach to this problem is based on machine learning techniques: a general inductive process automatically builds a classifier by learning, from a set of preclassified documents, the characteristics of the categories. The advantages of this approach over the knowledge engineering approach (consisting in the manual definition of a classifier by domain experts) are a very good effectiveness, considerable savings in terms of expert labor power, and straightforward portability to different domains. This survey discusses the main approaches to text categorization that fall within the machine learning paradigm. We will discuss in detail issues pertaining to three different problems, namely, document representation, classifier construction, and classifier evaluation.

Original languageEnglish
Pages (from-to)1-47
Number of pages47
JournalACM Computing Surveys
Volume34
Issue number1
DOIs
Publication statusPublished - Mar 2002
Externally publishedYes

Fingerprint

Text Categorization
Learning systems
Machine Learning
Classifiers
Classifier
Knowledge Engineering
Knowledge engineering
Portability
Categorization
Availability
Paradigm
Personnel
Evaluation

Keywords

  • Machine learning
  • Text categorization
  • Text classification

ASJC Scopus subject areas

  • Theoretical Computer Science
  • Computational Theory and Mathematics

Cite this

Machine Learning in Automated Text Categorization. / Sebastiani, Fabrizio.

In: ACM Computing Surveys, Vol. 34, No. 1, 03.2002, p. 1-47.

Research output: Contribution to journalArticle

Sebastiani, Fabrizio. / Machine Learning in Automated Text Categorization. In: ACM Computing Surveys. 2002 ; Vol. 34, No. 1. pp. 1-47.
@article{6dc1d9e4fb094e4796d8347dfdb58d2f,
title = "Machine Learning in Automated Text Categorization",
abstract = "The automated categorization (or classification) of texts into predefined categories has witnessed a booming interest in the last 10 years, due to the increased availability of documents in digital form and the ensuing need to organize them. In the research community the dominant approach to this problem is based on machine learning techniques: a general inductive process automatically builds a classifier by learning, from a set of preclassified documents, the characteristics of the categories. The advantages of this approach over the knowledge engineering approach (consisting in the manual definition of a classifier by domain experts) are a very good effectiveness, considerable savings in terms of expert labor power, and straightforward portability to different domains. This survey discusses the main approaches to text categorization that fall within the machine learning paradigm. We will discuss in detail issues pertaining to three different problems, namely, document representation, classifier construction, and classifier evaluation.",
keywords = "Machine learning, Text categorization, Text classification",
author = "Fabrizio Sebastiani",
year = "2002",
month = "3",
doi = "10.1145/505282.505283",
language = "English",
volume = "34",
pages = "1--47",
journal = "ACM Computing Surveys",
issn = "0360-0300",
publisher = "Association for Computing Machinery (ACM)",
number = "1",

}

TY - JOUR

T1 - Machine Learning in Automated Text Categorization

AU - Sebastiani, Fabrizio

PY - 2002/3

Y1 - 2002/3

N2 - The automated categorization (or classification) of texts into predefined categories has witnessed a booming interest in the last 10 years, due to the increased availability of documents in digital form and the ensuing need to organize them. In the research community the dominant approach to this problem is based on machine learning techniques: a general inductive process automatically builds a classifier by learning, from a set of preclassified documents, the characteristics of the categories. The advantages of this approach over the knowledge engineering approach (consisting in the manual definition of a classifier by domain experts) are a very good effectiveness, considerable savings in terms of expert labor power, and straightforward portability to different domains. This survey discusses the main approaches to text categorization that fall within the machine learning paradigm. We will discuss in detail issues pertaining to three different problems, namely, document representation, classifier construction, and classifier evaluation.

AB - The automated categorization (or classification) of texts into predefined categories has witnessed a booming interest in the last 10 years, due to the increased availability of documents in digital form and the ensuing need to organize them. In the research community the dominant approach to this problem is based on machine learning techniques: a general inductive process automatically builds a classifier by learning, from a set of preclassified documents, the characteristics of the categories. The advantages of this approach over the knowledge engineering approach (consisting in the manual definition of a classifier by domain experts) are a very good effectiveness, considerable savings in terms of expert labor power, and straightforward portability to different domains. This survey discusses the main approaches to text categorization that fall within the machine learning paradigm. We will discuss in detail issues pertaining to three different problems, namely, document representation, classifier construction, and classifier evaluation.

KW - Machine learning

KW - Text categorization

KW - Text classification

UR - http://www.scopus.com/inward/record.url?scp=0002442796&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=0002442796&partnerID=8YFLogxK

U2 - 10.1145/505282.505283

DO - 10.1145/505282.505283

M3 - Article

VL - 34

SP - 1

EP - 47

JO - ACM Computing Surveys

JF - ACM Computing Surveys

SN - 0360-0300

IS - 1

ER -