Lightweight random indexing for polylingual text classification

Alejandro Moreo Fernández, Andrea Esuli, Fabrizio Sebastiani

Research output: Contribution to journalArticle

2 Citations (Scopus)

Abstract

Multilingual Text Classification (MLTC) is a text classification task in which documents are written each in one among a set L of natural languages, and in which all documents must be classified under the same classification scheme, irrespective of language. There are two main variants of MLTC, namely Cross-Lingual Text Classification (CLTC) and Polylingual Text Classification (PLTC). In PLTC, which is the focus of this paper, we assume (differently from CLTC) that for each language in L there is a representative set of training documents; PLTC consists of improving the accuracy of each of the L monolingual classifiers by also leveraging the training documents written in the other (L - 1) languages. The obvious solution, consisting of generating a single polylingual classifier from the juxtaposed monolingual vector spaces, is usually infeasible, since the dimensionality of the resulting vector space is roughly L times that of a monolingual one, and is thus often unmanageable. As a response, the use of machine translation tools or multilingual dictionaries has been proposed. However, these resources are not always available, or are not always free to use. One machine-Translation-free and dictionary-free method that, to the best of our knowledge, has never been applied to PLTC before, is Random Indexing (RI). We analyse RI in terms of space and time efficiency, and propose a particular configuration of it (that we dub Lightweight Random Indexing { LRI). By running experiments on two well known public benchmarks, Reuters RCV1/RCV2 (a comparable corpus) and JRC-Acquis (a parallel one), we show LRI to outperform (both in terms of effectiveness and efficiency) a number of previously proposed machine-Translation-free and dictionary-free PLTC methods that we use as baselines.

Original languageEnglish
Pages (from-to)151-185
Number of pages35
JournalJournal of Artificial Intelligence Research
Volume57
Publication statusPublished - 1 Oct 2016
Externally publishedYes

Fingerprint

Indexing (of information)
Glossaries
Vector spaces
Classifiers

ASJC Scopus subject areas

  • Artificial Intelligence

Cite this

Fernández, A. M., Esuli, A., & Sebastiani, F. (2016). Lightweight random indexing for polylingual text classification. Journal of Artificial Intelligence Research, 57, 151-185.

Lightweight random indexing for polylingual text classification. / Fernández, Alejandro Moreo; Esuli, Andrea; Sebastiani, Fabrizio.

In: Journal of Artificial Intelligence Research, Vol. 57, 01.10.2016, p. 151-185.

Research output: Contribution to journalArticle

Fernández, AM, Esuli, A & Sebastiani, F 2016, 'Lightweight random indexing for polylingual text classification', Journal of Artificial Intelligence Research, vol. 57, pp. 151-185.
Fernández, Alejandro Moreo ; Esuli, Andrea ; Sebastiani, Fabrizio. / Lightweight random indexing for polylingual text classification. In: Journal of Artificial Intelligence Research. 2016 ; Vol. 57. pp. 151-185.
@article{2640377e87154378be32f8a171dfd834,
title = "Lightweight random indexing for polylingual text classification",
abstract = "Multilingual Text Classification (MLTC) is a text classification task in which documents are written each in one among a set L of natural languages, and in which all documents must be classified under the same classification scheme, irrespective of language. There are two main variants of MLTC, namely Cross-Lingual Text Classification (CLTC) and Polylingual Text Classification (PLTC). In PLTC, which is the focus of this paper, we assume (differently from CLTC) that for each language in L there is a representative set of training documents; PLTC consists of improving the accuracy of each of the L monolingual classifiers by also leveraging the training documents written in the other (L - 1) languages. The obvious solution, consisting of generating a single polylingual classifier from the juxtaposed monolingual vector spaces, is usually infeasible, since the dimensionality of the resulting vector space is roughly L times that of a monolingual one, and is thus often unmanageable. As a response, the use of machine translation tools or multilingual dictionaries has been proposed. However, these resources are not always available, or are not always free to use. One machine-Translation-free and dictionary-free method that, to the best of our knowledge, has never been applied to PLTC before, is Random Indexing (RI). We analyse RI in terms of space and time efficiency, and propose a particular configuration of it (that we dub Lightweight Random Indexing { LRI). By running experiments on two well known public benchmarks, Reuters RCV1/RCV2 (a comparable corpus) and JRC-Acquis (a parallel one), we show LRI to outperform (both in terms of effectiveness and efficiency) a number of previously proposed machine-Translation-free and dictionary-free PLTC methods that we use as baselines.",
author = "Fern{\'a}ndez, {Alejandro Moreo} and Andrea Esuli and Fabrizio Sebastiani",
year = "2016",
month = "10",
day = "1",
language = "English",
volume = "57",
pages = "151--185",
journal = "Journal of Artificial Intelligence Research",
issn = "1076-9757",
publisher = "Morgan Kaufmann Publishers, Inc.",

}

TY - JOUR

T1 - Lightweight random indexing for polylingual text classification

AU - Fernández, Alejandro Moreo

AU - Esuli, Andrea

AU - Sebastiani, Fabrizio

PY - 2016/10/1

Y1 - 2016/10/1

N2 - Multilingual Text Classification (MLTC) is a text classification task in which documents are written each in one among a set L of natural languages, and in which all documents must be classified under the same classification scheme, irrespective of language. There are two main variants of MLTC, namely Cross-Lingual Text Classification (CLTC) and Polylingual Text Classification (PLTC). In PLTC, which is the focus of this paper, we assume (differently from CLTC) that for each language in L there is a representative set of training documents; PLTC consists of improving the accuracy of each of the L monolingual classifiers by also leveraging the training documents written in the other (L - 1) languages. The obvious solution, consisting of generating a single polylingual classifier from the juxtaposed monolingual vector spaces, is usually infeasible, since the dimensionality of the resulting vector space is roughly L times that of a monolingual one, and is thus often unmanageable. As a response, the use of machine translation tools or multilingual dictionaries has been proposed. However, these resources are not always available, or are not always free to use. One machine-Translation-free and dictionary-free method that, to the best of our knowledge, has never been applied to PLTC before, is Random Indexing (RI). We analyse RI in terms of space and time efficiency, and propose a particular configuration of it (that we dub Lightweight Random Indexing { LRI). By running experiments on two well known public benchmarks, Reuters RCV1/RCV2 (a comparable corpus) and JRC-Acquis (a parallel one), we show LRI to outperform (both in terms of effectiveness and efficiency) a number of previously proposed machine-Translation-free and dictionary-free PLTC methods that we use as baselines.

AB - Multilingual Text Classification (MLTC) is a text classification task in which documents are written each in one among a set L of natural languages, and in which all documents must be classified under the same classification scheme, irrespective of language. There are two main variants of MLTC, namely Cross-Lingual Text Classification (CLTC) and Polylingual Text Classification (PLTC). In PLTC, which is the focus of this paper, we assume (differently from CLTC) that for each language in L there is a representative set of training documents; PLTC consists of improving the accuracy of each of the L monolingual classifiers by also leveraging the training documents written in the other (L - 1) languages. The obvious solution, consisting of generating a single polylingual classifier from the juxtaposed monolingual vector spaces, is usually infeasible, since the dimensionality of the resulting vector space is roughly L times that of a monolingual one, and is thus often unmanageable. As a response, the use of machine translation tools or multilingual dictionaries has been proposed. However, these resources are not always available, or are not always free to use. One machine-Translation-free and dictionary-free method that, to the best of our knowledge, has never been applied to PLTC before, is Random Indexing (RI). We analyse RI in terms of space and time efficiency, and propose a particular configuration of it (that we dub Lightweight Random Indexing { LRI). By running experiments on two well known public benchmarks, Reuters RCV1/RCV2 (a comparable corpus) and JRC-Acquis (a parallel one), we show LRI to outperform (both in terms of effectiveness and efficiency) a number of previously proposed machine-Translation-free and dictionary-free PLTC methods that we use as baselines.

UR - http://www.scopus.com/inward/record.url?scp=84995582086&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84995582086&partnerID=8YFLogxK

M3 - Article

VL - 57

SP - 151

EP - 185

JO - Journal of Artificial Intelligence Research

JF - Journal of Artificial Intelligence Research

SN - 1076-9757

ER -