An information-theoretic, vector-space-model approach to cross-language information retrieval

Peter A. Chew, Brett W. Bader, Stephen Helmreich, Ahmed Abdelali, Stephen J. Verzi

Research output: Contribution to journalArticle

8 Citations (Scopus)

Abstract

In this article, we demonstrate several novel ways in which insights from information theory (IT) and computational linguistics (CL) can be woven into a vector-space-model (VSM) approach to information retrieval (IR). Our proposals focus, essentially, on three areas: pre-processing (morphological analysis), term weighting, and alternative geometrical models to the widely used term-by-document matrix. The latter include (1) PARAFAC2 decomposition of a term-by-document-by-language tensor, and (2) eigenvalue decomposition of a term-by-term matrix (inspired by Statistical Machine Translation). We evaluate all proposals, comparing them to a 'standard' approach based on Latent Semantic Analysis, on a multilingual document clustering task. The evidence suggests that proper consideration of IT within IR is indeed called for: in all cases, our best results are achieved using the information-theoretic variations upon the standard approach. Furthermore, we show that different information-theoretic options can be combined for still better results. A key function of language is to encode and convey information, and contributions of IT to the field of CL can be traced back a number of decades. We think that our proposals help bring IR and CL more into line with one another. In our conclusion, we suggest that the fact that our proposals yield empirical improvements is not coincidental given that they increase the theoretical transparency of VSM approaches to IR; on the contrary, they help shed light on why aspects of these approaches work as they do.

Original languageEnglish
Pages (from-to)37-70
Number of pages34
JournalNatural Language Engineering
Volume17
Issue number1
DOIs
Publication statusPublished - 1 Jan 2011
Externally publishedYes

Fingerprint

Query languages
Computational linguistics
Vector spaces
Information retrieval
computational linguistics
information retrieval
information theory
Information theory
language
Decomposition
weighting
Transparency
transparency
Tensors
Semantics
semantics
Information Retrieval
Cross-language
Vector Space Model
Processing

ASJC Scopus subject areas

  • Software
  • Artificial Intelligence

Cite this

An information-theoretic, vector-space-model approach to cross-language information retrieval. / Chew, Peter A.; Bader, Brett W.; Helmreich, Stephen; Abdelali, Ahmed; Verzi, Stephen J.

In: Natural Language Engineering, Vol. 17, No. 1, 01.01.2011, p. 37-70.

Research output: Contribution to journalArticle

Chew, Peter A. ; Bader, Brett W. ; Helmreich, Stephen ; Abdelali, Ahmed ; Verzi, Stephen J. / An information-theoretic, vector-space-model approach to cross-language information retrieval. In: Natural Language Engineering. 2011 ; Vol. 17, No. 1. pp. 37-70.
@article{319d876195e342439a81fae28f543e8b,
title = "An information-theoretic, vector-space-model approach to cross-language information retrieval",
abstract = "In this article, we demonstrate several novel ways in which insights from information theory (IT) and computational linguistics (CL) can be woven into a vector-space-model (VSM) approach to information retrieval (IR). Our proposals focus, essentially, on three areas: pre-processing (morphological analysis), term weighting, and alternative geometrical models to the widely used term-by-document matrix. The latter include (1) PARAFAC2 decomposition of a term-by-document-by-language tensor, and (2) eigenvalue decomposition of a term-by-term matrix (inspired by Statistical Machine Translation). We evaluate all proposals, comparing them to a 'standard' approach based on Latent Semantic Analysis, on a multilingual document clustering task. The evidence suggests that proper consideration of IT within IR is indeed called for: in all cases, our best results are achieved using the information-theoretic variations upon the standard approach. Furthermore, we show that different information-theoretic options can be combined for still better results. A key function of language is to encode and convey information, and contributions of IT to the field of CL can be traced back a number of decades. We think that our proposals help bring IR and CL more into line with one another. In our conclusion, we suggest that the fact that our proposals yield empirical improvements is not coincidental given that they increase the theoretical transparency of VSM approaches to IR; on the contrary, they help shed light on why aspects of these approaches work as they do.",
author = "Chew, {Peter A.} and Bader, {Brett W.} and Stephen Helmreich and Ahmed Abdelali and Verzi, {Stephen J.}",
year = "2011",
month = "1",
day = "1",
doi = "10.1017/S1351324910000185",
language = "English",
volume = "17",
pages = "37--70",
journal = "Natural Language Engineering",
issn = "1351-3249",
publisher = "Cambridge University Press",
number = "1",

}

TY - JOUR

T1 - An information-theoretic, vector-space-model approach to cross-language information retrieval

AU - Chew, Peter A.

AU - Bader, Brett W.

AU - Helmreich, Stephen

AU - Abdelali, Ahmed

AU - Verzi, Stephen J.

PY - 2011/1/1

Y1 - 2011/1/1

N2 - In this article, we demonstrate several novel ways in which insights from information theory (IT) and computational linguistics (CL) can be woven into a vector-space-model (VSM) approach to information retrieval (IR). Our proposals focus, essentially, on three areas: pre-processing (morphological analysis), term weighting, and alternative geometrical models to the widely used term-by-document matrix. The latter include (1) PARAFAC2 decomposition of a term-by-document-by-language tensor, and (2) eigenvalue decomposition of a term-by-term matrix (inspired by Statistical Machine Translation). We evaluate all proposals, comparing them to a 'standard' approach based on Latent Semantic Analysis, on a multilingual document clustering task. The evidence suggests that proper consideration of IT within IR is indeed called for: in all cases, our best results are achieved using the information-theoretic variations upon the standard approach. Furthermore, we show that different information-theoretic options can be combined for still better results. A key function of language is to encode and convey information, and contributions of IT to the field of CL can be traced back a number of decades. We think that our proposals help bring IR and CL more into line with one another. In our conclusion, we suggest that the fact that our proposals yield empirical improvements is not coincidental given that they increase the theoretical transparency of VSM approaches to IR; on the contrary, they help shed light on why aspects of these approaches work as they do.

AB - In this article, we demonstrate several novel ways in which insights from information theory (IT) and computational linguistics (CL) can be woven into a vector-space-model (VSM) approach to information retrieval (IR). Our proposals focus, essentially, on three areas: pre-processing (morphological analysis), term weighting, and alternative geometrical models to the widely used term-by-document matrix. The latter include (1) PARAFAC2 decomposition of a term-by-document-by-language tensor, and (2) eigenvalue decomposition of a term-by-term matrix (inspired by Statistical Machine Translation). We evaluate all proposals, comparing them to a 'standard' approach based on Latent Semantic Analysis, on a multilingual document clustering task. The evidence suggests that proper consideration of IT within IR is indeed called for: in all cases, our best results are achieved using the information-theoretic variations upon the standard approach. Furthermore, we show that different information-theoretic options can be combined for still better results. A key function of language is to encode and convey information, and contributions of IT to the field of CL can be traced back a number of decades. We think that our proposals help bring IR and CL more into line with one another. In our conclusion, we suggest that the fact that our proposals yield empirical improvements is not coincidental given that they increase the theoretical transparency of VSM approaches to IR; on the contrary, they help shed light on why aspects of these approaches work as they do.

UR - http://www.scopus.com/inward/record.url?scp=79957483607&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=79957483607&partnerID=8YFLogxK

U2 - 10.1017/S1351324910000185

DO - 10.1017/S1351324910000185

M3 - Article

VL - 17

SP - 37

EP - 70

JO - Natural Language Engineering

JF - Natural Language Engineering

SN - 1351-3249

IS - 1

ER -