A statistical approach to crosslingual natural language tasks

David Pinto, Jorge Civera, Alberto Barron, Alfons Juan, Paolo Rosso

Research output: Contribution to journalArticle

32 Citations (Scopus)

Abstract

The existence of huge volumes of documents written in multiple languages on Internet leads to investigate novel algorithmic approaches to deal with information of this kind. However, most crosslingual natural language processing (NLP) tasks consider a decoupled approach in which monolingual NLP techniques are applied along with an independent translation process. This two-step approach is too sensitive to translation errors, and in general to the accumulative effect of errors. To solve this problem, we propose to use a direct probabilistic crosslingual NLP system which integrates both steps, translation and the specific NLP task, into a single one. In order to perform this integrated approach to crosslingual tasks, we propose to use the statistical IBM 1 word alignment model (M1). The M1 model may show a non-monotonic behaviour when aligning words from a sentence in a source language to words from another sentence in a different, target language. This is the case of languages with different word order. In English, for instance, adjectives appear before nouns, whereas in Spanish it is exactly the opposite. The successful experimental results reported in three different tasks - text classification, information retrieval and plagiarism analysis - highlight the benefits of the statistical integrated approach proposed in this work.

Original languageEnglish
Pages (from-to)51-60
Number of pages10
JournalJournal of Algorithms
Volume64
Issue number1
DOIs
Publication statusPublished - 2009
Externally publishedYes

Fingerprint

Natural Language
Natural language processing systems
Processing
Information retrieval
Text Classification
Internet
Information Retrieval
Alignment
Integrate
Target
Language
Experimental Results
Model

Keywords

  • Crosslingual data
  • IBM translation models
  • Information retrieval
  • Natural language processing
  • Plagiarism analysis
  • Text classification

ASJC Scopus subject areas

  • Computational Mathematics
  • Control and Optimization
  • Computational Theory and Mathematics

Cite this

A statistical approach to crosslingual natural language tasks. / Pinto, David; Civera, Jorge; Barron, Alberto; Juan, Alfons; Rosso, Paolo.

In: Journal of Algorithms, Vol. 64, No. 1, 2009, p. 51-60.

Research output: Contribution to journalArticle

Pinto, David ; Civera, Jorge ; Barron, Alberto ; Juan, Alfons ; Rosso, Paolo. / A statistical approach to crosslingual natural language tasks. In: Journal of Algorithms. 2009 ; Vol. 64, No. 1. pp. 51-60.
@article{df6d15e0c88948a9a65afbc6f03c4a53,
title = "A statistical approach to crosslingual natural language tasks",
abstract = "The existence of huge volumes of documents written in multiple languages on Internet leads to investigate novel algorithmic approaches to deal with information of this kind. However, most crosslingual natural language processing (NLP) tasks consider a decoupled approach in which monolingual NLP techniques are applied along with an independent translation process. This two-step approach is too sensitive to translation errors, and in general to the accumulative effect of errors. To solve this problem, we propose to use a direct probabilistic crosslingual NLP system which integrates both steps, translation and the specific NLP task, into a single one. In order to perform this integrated approach to crosslingual tasks, we propose to use the statistical IBM 1 word alignment model (M1). The M1 model may show a non-monotonic behaviour when aligning words from a sentence in a source language to words from another sentence in a different, target language. This is the case of languages with different word order. In English, for instance, adjectives appear before nouns, whereas in Spanish it is exactly the opposite. The successful experimental results reported in three different tasks - text classification, information retrieval and plagiarism analysis - highlight the benefits of the statistical integrated approach proposed in this work.",
keywords = "Crosslingual data, IBM translation models, Information retrieval, Natural language processing, Plagiarism analysis, Text classification",
author = "David Pinto and Jorge Civera and Alberto Barron and Alfons Juan and Paolo Rosso",
year = "2009",
doi = "10.1016/j.jalgor.2009.02.005",
language = "English",
volume = "64",
pages = "51--60",
journal = "Journal of Algorithms",
issn = "0196-6774",
publisher = "Academic Press Inc.",
number = "1",

}

TY - JOUR

T1 - A statistical approach to crosslingual natural language tasks

AU - Pinto, David

AU - Civera, Jorge

AU - Barron, Alberto

AU - Juan, Alfons

AU - Rosso, Paolo

PY - 2009

Y1 - 2009

N2 - The existence of huge volumes of documents written in multiple languages on Internet leads to investigate novel algorithmic approaches to deal with information of this kind. However, most crosslingual natural language processing (NLP) tasks consider a decoupled approach in which monolingual NLP techniques are applied along with an independent translation process. This two-step approach is too sensitive to translation errors, and in general to the accumulative effect of errors. To solve this problem, we propose to use a direct probabilistic crosslingual NLP system which integrates both steps, translation and the specific NLP task, into a single one. In order to perform this integrated approach to crosslingual tasks, we propose to use the statistical IBM 1 word alignment model (M1). The M1 model may show a non-monotonic behaviour when aligning words from a sentence in a source language to words from another sentence in a different, target language. This is the case of languages with different word order. In English, for instance, adjectives appear before nouns, whereas in Spanish it is exactly the opposite. The successful experimental results reported in three different tasks - text classification, information retrieval and plagiarism analysis - highlight the benefits of the statistical integrated approach proposed in this work.

AB - The existence of huge volumes of documents written in multiple languages on Internet leads to investigate novel algorithmic approaches to deal with information of this kind. However, most crosslingual natural language processing (NLP) tasks consider a decoupled approach in which monolingual NLP techniques are applied along with an independent translation process. This two-step approach is too sensitive to translation errors, and in general to the accumulative effect of errors. To solve this problem, we propose to use a direct probabilistic crosslingual NLP system which integrates both steps, translation and the specific NLP task, into a single one. In order to perform this integrated approach to crosslingual tasks, we propose to use the statistical IBM 1 word alignment model (M1). The M1 model may show a non-monotonic behaviour when aligning words from a sentence in a source language to words from another sentence in a different, target language. This is the case of languages with different word order. In English, for instance, adjectives appear before nouns, whereas in Spanish it is exactly the opposite. The successful experimental results reported in three different tasks - text classification, information retrieval and plagiarism analysis - highlight the benefits of the statistical integrated approach proposed in this work.

KW - Crosslingual data

KW - IBM translation models

KW - Information retrieval

KW - Natural language processing

KW - Plagiarism analysis

KW - Text classification

UR - http://www.scopus.com/inward/record.url?scp=84940363657&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84940363657&partnerID=8YFLogxK

U2 - 10.1016/j.jalgor.2009.02.005

DO - 10.1016/j.jalgor.2009.02.005

M3 - Article

AN - SCOPUS:84940363657

VL - 64

SP - 51

EP - 60

JO - Journal of Algorithms

JF - Journal of Algorithms

SN - 0196-6774

IS - 1

ER -