Building thematic lexical resources by term categorization

Alberto Lavelli, Bernardo Magnini, Fabrizio Sebastiani

Research output: Contribution to journalArticle

Abstract

We discuss the automatic generation of thematic lexicons by means of term categorization, a novel task employing techniques from information retrieval (IR) and machine learning (ML). Specifically, we view the generation of such lexicons as an iterative process of learning previously unknown associations between terms and themes (i.e. disciplines, or fields of activity). The process is iterative, in that it generates, for each ci in a set = {ci,...,cm} of themes, a sequence L0i ⊆ L1i ⊆ ... ⊆ Lni of lexicons, bootstrapping from an initial lexicon L0i and a set of text corpora Θ = {θ0,...,θn-1} given as input. The method is inspired by text categorization, the discipline concerned with labelling natural language texts with labels from a predefined set of themes, or categories. However, while text categorization deals with documents represented as vectors in a space of terms, term categorization deals (dually) with terms represented as vectors in a space of documents, and labels terms (instead of documents) with themes. As a learning device we adopt boosting, since (a) it has demonstrated state-of-the-art effectiveness in a variety of text categorization applications, and (b) it naturally allows for a form of "data cleaning", thereby making the process of generating a thematic lexicon an iteration of generate-and-test steps.

Original languageEnglish
Pages (from-to)415-416
Number of pages2
JournalSIGIR Forum (ACM Special Interest Group on Information Retrieval)
Publication statusPublished - 2002
Externally publishedYes

Fingerprint

Labels
Information retrieval
Labeling
Learning systems
Cleaning
Resources
Text categorization
Boosting
Data cleaning
Machine learning
Bootstrapping
Language

ASJC Scopus subject areas

  • Management Information Systems
  • Hardware and Architecture

Cite this

Building thematic lexical resources by term categorization. / Lavelli, Alberto; Magnini, Bernardo; Sebastiani, Fabrizio.

In: SIGIR Forum (ACM Special Interest Group on Information Retrieval), 2002, p. 415-416.

Research output: Contribution to journalArticle

Lavelli, Alberto ; Magnini, Bernardo ; Sebastiani, Fabrizio. / Building thematic lexical resources by term categorization. In: SIGIR Forum (ACM Special Interest Group on Information Retrieval). 2002 ; pp. 415-416.
@article{fa8e266a5fa94986a72a865b75f86ee5,
title = "Building thematic lexical resources by term categorization",
abstract = "We discuss the automatic generation of thematic lexicons by means of term categorization, a novel task employing techniques from information retrieval (IR) and machine learning (ML). Specifically, we view the generation of such lexicons as an iterative process of learning previously unknown associations between terms and themes (i.e. disciplines, or fields of activity). The process is iterative, in that it generates, for each ci in a set = {ci,...,cm} of themes, a sequence L0i ⊆ L1i ⊆ ... ⊆ Lni of lexicons, bootstrapping from an initial lexicon L0i and a set of text corpora Θ = {θ0,...,θn-1} given as input. The method is inspired by text categorization, the discipline concerned with labelling natural language texts with labels from a predefined set of themes, or categories. However, while text categorization deals with documents represented as vectors in a space of terms, term categorization deals (dually) with terms represented as vectors in a space of documents, and labels terms (instead of documents) with themes. As a learning device we adopt boosting, since (a) it has demonstrated state-of-the-art effectiveness in a variety of text categorization applications, and (b) it naturally allows for a form of {"}data cleaning{"}, thereby making the process of generating a thematic lexicon an iteration of generate-and-test steps.",
author = "Alberto Lavelli and Bernardo Magnini and Fabrizio Sebastiani",
year = "2002",
language = "English",
pages = "415--416",
journal = "SIGIR Forum (ACM Special Interest Group on Information Retrieval)",
issn = "0163-5840",
publisher = "Association for Computing Machinery (ACM)",

}

TY - JOUR

T1 - Building thematic lexical resources by term categorization

AU - Lavelli, Alberto

AU - Magnini, Bernardo

AU - Sebastiani, Fabrizio

PY - 2002

Y1 - 2002

N2 - We discuss the automatic generation of thematic lexicons by means of term categorization, a novel task employing techniques from information retrieval (IR) and machine learning (ML). Specifically, we view the generation of such lexicons as an iterative process of learning previously unknown associations between terms and themes (i.e. disciplines, or fields of activity). The process is iterative, in that it generates, for each ci in a set = {ci,...,cm} of themes, a sequence L0i ⊆ L1i ⊆ ... ⊆ Lni of lexicons, bootstrapping from an initial lexicon L0i and a set of text corpora Θ = {θ0,...,θn-1} given as input. The method is inspired by text categorization, the discipline concerned with labelling natural language texts with labels from a predefined set of themes, or categories. However, while text categorization deals with documents represented as vectors in a space of terms, term categorization deals (dually) with terms represented as vectors in a space of documents, and labels terms (instead of documents) with themes. As a learning device we adopt boosting, since (a) it has demonstrated state-of-the-art effectiveness in a variety of text categorization applications, and (b) it naturally allows for a form of "data cleaning", thereby making the process of generating a thematic lexicon an iteration of generate-and-test steps.

AB - We discuss the automatic generation of thematic lexicons by means of term categorization, a novel task employing techniques from information retrieval (IR) and machine learning (ML). Specifically, we view the generation of such lexicons as an iterative process of learning previously unknown associations between terms and themes (i.e. disciplines, or fields of activity). The process is iterative, in that it generates, for each ci in a set = {ci,...,cm} of themes, a sequence L0i ⊆ L1i ⊆ ... ⊆ Lni of lexicons, bootstrapping from an initial lexicon L0i and a set of text corpora Θ = {θ0,...,θn-1} given as input. The method is inspired by text categorization, the discipline concerned with labelling natural language texts with labels from a predefined set of themes, or categories. However, while text categorization deals with documents represented as vectors in a space of terms, term categorization deals (dually) with terms represented as vectors in a space of documents, and labels terms (instead of documents) with themes. As a learning device we adopt boosting, since (a) it has demonstrated state-of-the-art effectiveness in a variety of text categorization applications, and (b) it naturally allows for a form of "data cleaning", thereby making the process of generating a thematic lexicon an iteration of generate-and-test steps.

UR - http://www.scopus.com/inward/record.url?scp=0036993122&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=0036993122&partnerID=8YFLogxK

M3 - Article

SP - 415

EP - 416

JO - SIGIR Forum (ACM Special Interest Group on Information Retrieval)

JF - SIGIR Forum (ACM Special Interest Group on Information Retrieval)

SN - 0163-5840

ER -