Extracting parallel phrases from comparable data for machine translation

Sanjika Hewavitharana, Stephan Vogel

Research output: Contribution to journalArticle

3 Citations (Scopus)

Abstract

Mining parallel data from comparable corpora is a promising approach for overcoming the data sparseness in statistical machine translation and other natural language processing applications. In this paper, we address the task of detecting parallel phrase pairs embedded in comparable sentence pairs. We present a novel phrase alignment approach that is designed to only align parallel sections bypassing non-parallel sections of the sentence. We compare the proposed approach with two other alignment methods: (1) the standard phrase extraction algorithm, which relies on the Viterbi path of the word alignment, (2) a binary classifier to detect parallel phrase pairs when presented with a large collection of phrase pair candidates. We evaluate the accuracy of these approaches using a manually aligned data set, and show that the proposed approach outperforms the other two approaches. Finally, we demonstrate the effectiveness of the extracted phrase pairs by using them in Arabic-English and Urdu-English translation systems, which resulted in improvements upto 1.2 Bleu over the baseline. The main contributions of this paper are two-fold: (1) novel phrase alignment algorithms to extract parallel phrase pairs from comparable sentences, (2) evaluating the utility of the extracted phrases by using them directly in the MT decoder.

Original languageEnglish
Pages (from-to)549-573
Number of pages25
JournalNatural Language Engineering
Volume22
Issue number4
DOIs
Publication statusPublished - 1 Jul 2016

Fingerprint

candidacy
Classifiers
language
Machine Translation
Processing
Alignment
English Translation
Comparable Corpora
Natural Language Processing
Urdu
Statistical Machine Translation
Classifier
Translation System
Fold

ASJC Scopus subject areas

  • Software
  • Artificial Intelligence
  • Language and Linguistics
  • Linguistics and Language

Cite this

Extracting parallel phrases from comparable data for machine translation. / Hewavitharana, Sanjika; Vogel, Stephan.

In: Natural Language Engineering, Vol. 22, No. 4, 01.07.2016, p. 549-573.

Research output: Contribution to journalArticle

@article{78241324f60e4f54a878ea2e9e286509,
title = "Extracting parallel phrases from comparable data for machine translation",
abstract = "Mining parallel data from comparable corpora is a promising approach for overcoming the data sparseness in statistical machine translation and other natural language processing applications. In this paper, we address the task of detecting parallel phrase pairs embedded in comparable sentence pairs. We present a novel phrase alignment approach that is designed to only align parallel sections bypassing non-parallel sections of the sentence. We compare the proposed approach with two other alignment methods: (1) the standard phrase extraction algorithm, which relies on the Viterbi path of the word alignment, (2) a binary classifier to detect parallel phrase pairs when presented with a large collection of phrase pair candidates. We evaluate the accuracy of these approaches using a manually aligned data set, and show that the proposed approach outperforms the other two approaches. Finally, we demonstrate the effectiveness of the extracted phrase pairs by using them in Arabic-English and Urdu-English translation systems, which resulted in improvements upto 1.2 Bleu over the baseline. The main contributions of this paper are two-fold: (1) novel phrase alignment algorithms to extract parallel phrase pairs from comparable sentences, (2) evaluating the utility of the extracted phrases by using them directly in the MT decoder.",
author = "Sanjika Hewavitharana and Stephan Vogel",
year = "2016",
month = "7",
day = "1",
doi = "10.1017/S1351324916000139",
language = "English",
volume = "22",
pages = "549--573",
journal = "Natural Language Engineering",
issn = "1351-3249",
publisher = "Cambridge University Press",
number = "4",

}

TY - JOUR

T1 - Extracting parallel phrases from comparable data for machine translation

AU - Hewavitharana, Sanjika

AU - Vogel, Stephan

PY - 2016/7/1

Y1 - 2016/7/1

N2 - Mining parallel data from comparable corpora is a promising approach for overcoming the data sparseness in statistical machine translation and other natural language processing applications. In this paper, we address the task of detecting parallel phrase pairs embedded in comparable sentence pairs. We present a novel phrase alignment approach that is designed to only align parallel sections bypassing non-parallel sections of the sentence. We compare the proposed approach with two other alignment methods: (1) the standard phrase extraction algorithm, which relies on the Viterbi path of the word alignment, (2) a binary classifier to detect parallel phrase pairs when presented with a large collection of phrase pair candidates. We evaluate the accuracy of these approaches using a manually aligned data set, and show that the proposed approach outperforms the other two approaches. Finally, we demonstrate the effectiveness of the extracted phrase pairs by using them in Arabic-English and Urdu-English translation systems, which resulted in improvements upto 1.2 Bleu over the baseline. The main contributions of this paper are two-fold: (1) novel phrase alignment algorithms to extract parallel phrase pairs from comparable sentences, (2) evaluating the utility of the extracted phrases by using them directly in the MT decoder.

AB - Mining parallel data from comparable corpora is a promising approach for overcoming the data sparseness in statistical machine translation and other natural language processing applications. In this paper, we address the task of detecting parallel phrase pairs embedded in comparable sentence pairs. We present a novel phrase alignment approach that is designed to only align parallel sections bypassing non-parallel sections of the sentence. We compare the proposed approach with two other alignment methods: (1) the standard phrase extraction algorithm, which relies on the Viterbi path of the word alignment, (2) a binary classifier to detect parallel phrase pairs when presented with a large collection of phrase pair candidates. We evaluate the accuracy of these approaches using a manually aligned data set, and show that the proposed approach outperforms the other two approaches. Finally, we demonstrate the effectiveness of the extracted phrase pairs by using them in Arabic-English and Urdu-English translation systems, which resulted in improvements upto 1.2 Bleu over the baseline. The main contributions of this paper are two-fold: (1) novel phrase alignment algorithms to extract parallel phrase pairs from comparable sentences, (2) evaluating the utility of the extracted phrases by using them directly in the MT decoder.

UR - http://www.scopus.com/inward/record.url?scp=84975259922&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84975259922&partnerID=8YFLogxK

U2 - 10.1017/S1351324916000139

DO - 10.1017/S1351324916000139

M3 - Article

AN - SCOPUS:84975259922

VL - 22

SP - 549

EP - 573

JO - Natural Language Engineering

JF - Natural Language Engineering

SN - 1351-3249

IS - 4

ER -