Latent Morpho-Semantic Analysis

Multilingual information retrieval with character n-grams and mutual information

Peter A. Chew, Brett W. Bader, Ahmed Abdelali

Research output: Chapter in Book/Report/Conference proceedingConference contribution

4 Citations (Scopus)

Abstract

We describe an entirely statistics-based, unsupervised, and language-independent approach to multilingual information retrieval, which we call Latent Morpho-Semantic Analysis (LMSA). LMSA overcomes some of the shortcomings of related previous approaches such as Latent Semantic Analysis (LSA). LMSA has an important theoretical advantage over LSA: it combines well-known techniques in a novel way to break the terms of LSA down into units which correspond more closely to morphemes. Thus, it has a particular appeal for use with morphologically complex languages such as Arabic. We show through empirical results that the theoretical advantages of LMSA can translate into significant gains in precision in multilingual information retrieval tests. These gains are not matched either when a standard stemmer is used with LSA, or when terms are indiscriminately broken down into n-grams.

Original languageEnglish
Title of host publicationColing 2008 - 22nd International Conference on Computational Linguistics, Proceedings of the Conference
Pages129-136
Number of pages8
Volume1
Publication statusPublished - 1 Dec 2008
Externally publishedYes
Event22nd International Conference on Computational Linguistics, Coling 2008 - Manchester, United Kingdom
Duration: 18 Aug 200822 Aug 2008

Other

Other22nd International Conference on Computational Linguistics, Coling 2008
CountryUnited Kingdom
CityManchester
Period18/8/0822/8/08

Fingerprint

Information retrieval
information retrieval
Semantics
semantics
Information Retrieval
Latent Semantic Analysis
N-gram
Semantic Analysis
Mutual Information
language
appeal
statistics
Statistics

ASJC Scopus subject areas

  • Language and Linguistics
  • Computational Theory and Mathematics
  • Linguistics and Language

Cite this

Chew, P. A., Bader, B. W., & Abdelali, A. (2008). Latent Morpho-Semantic Analysis: Multilingual information retrieval with character n-grams and mutual information. In Coling 2008 - 22nd International Conference on Computational Linguistics, Proceedings of the Conference (Vol. 1, pp. 129-136)

Latent Morpho-Semantic Analysis : Multilingual information retrieval with character n-grams and mutual information. / Chew, Peter A.; Bader, Brett W.; Abdelali, Ahmed.

Coling 2008 - 22nd International Conference on Computational Linguistics, Proceedings of the Conference. Vol. 1 2008. p. 129-136.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Chew, PA, Bader, BW & Abdelali, A 2008, Latent Morpho-Semantic Analysis: Multilingual information retrieval with character n-grams and mutual information. in Coling 2008 - 22nd International Conference on Computational Linguistics, Proceedings of the Conference. vol. 1, pp. 129-136, 22nd International Conference on Computational Linguistics, Coling 2008, Manchester, United Kingdom, 18/8/08.
Chew PA, Bader BW, Abdelali A. Latent Morpho-Semantic Analysis: Multilingual information retrieval with character n-grams and mutual information. In Coling 2008 - 22nd International Conference on Computational Linguistics, Proceedings of the Conference. Vol. 1. 2008. p. 129-136
Chew, Peter A. ; Bader, Brett W. ; Abdelali, Ahmed. / Latent Morpho-Semantic Analysis : Multilingual information retrieval with character n-grams and mutual information. Coling 2008 - 22nd International Conference on Computational Linguistics, Proceedings of the Conference. Vol. 1 2008. pp. 129-136
@inproceedings{ff5d030ee7bb43389e9819d4b5511e07,
title = "Latent Morpho-Semantic Analysis: Multilingual information retrieval with character n-grams and mutual information",
abstract = "We describe an entirely statistics-based, unsupervised, and language-independent approach to multilingual information retrieval, which we call Latent Morpho-Semantic Analysis (LMSA). LMSA overcomes some of the shortcomings of related previous approaches such as Latent Semantic Analysis (LSA). LMSA has an important theoretical advantage over LSA: it combines well-known techniques in a novel way to break the terms of LSA down into units which correspond more closely to morphemes. Thus, it has a particular appeal for use with morphologically complex languages such as Arabic. We show through empirical results that the theoretical advantages of LMSA can translate into significant gains in precision in multilingual information retrieval tests. These gains are not matched either when a standard stemmer is used with LSA, or when terms are indiscriminately broken down into n-grams.",
author = "Chew, {Peter A.} and Bader, {Brett W.} and Ahmed Abdelali",
year = "2008",
month = "12",
day = "1",
language = "English",
isbn = "9781905593446",
volume = "1",
pages = "129--136",
booktitle = "Coling 2008 - 22nd International Conference on Computational Linguistics, Proceedings of the Conference",

}

TY - GEN

T1 - Latent Morpho-Semantic Analysis

T2 - Multilingual information retrieval with character n-grams and mutual information

AU - Chew, Peter A.

AU - Bader, Brett W.

AU - Abdelali, Ahmed

PY - 2008/12/1

Y1 - 2008/12/1

N2 - We describe an entirely statistics-based, unsupervised, and language-independent approach to multilingual information retrieval, which we call Latent Morpho-Semantic Analysis (LMSA). LMSA overcomes some of the shortcomings of related previous approaches such as Latent Semantic Analysis (LSA). LMSA has an important theoretical advantage over LSA: it combines well-known techniques in a novel way to break the terms of LSA down into units which correspond more closely to morphemes. Thus, it has a particular appeal for use with morphologically complex languages such as Arabic. We show through empirical results that the theoretical advantages of LMSA can translate into significant gains in precision in multilingual information retrieval tests. These gains are not matched either when a standard stemmer is used with LSA, or when terms are indiscriminately broken down into n-grams.

AB - We describe an entirely statistics-based, unsupervised, and language-independent approach to multilingual information retrieval, which we call Latent Morpho-Semantic Analysis (LMSA). LMSA overcomes some of the shortcomings of related previous approaches such as Latent Semantic Analysis (LSA). LMSA has an important theoretical advantage over LSA: it combines well-known techniques in a novel way to break the terms of LSA down into units which correspond more closely to morphemes. Thus, it has a particular appeal for use with morphologically complex languages such as Arabic. We show through empirical results that the theoretical advantages of LMSA can translate into significant gains in precision in multilingual information retrieval tests. These gains are not matched either when a standard stemmer is used with LSA, or when terms are indiscriminately broken down into n-grams.

UR - http://www.scopus.com/inward/record.url?scp=80053388503&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=80053388503&partnerID=8YFLogxK

M3 - Conference contribution

SN - 9781905593446

VL - 1

SP - 129

EP - 136

BT - Coling 2008 - 22nd International Conference on Computational Linguistics, Proceedings of the Conference

ER -