Extracting parallel phrases from comparable data for machine translation

Sanjika Hewavitharana, Stephan Vogel

Research output: Contribution to journalArticle

5 Citations (Scopus)


Mining parallel data from comparable corpora is a promising approach for overcoming the data sparseness in statistical machine translation and other natural language processing applications. In this paper, we address the task of detecting parallel phrase pairs embedded in comparable sentence pairs. We present a novel phrase alignment approach that is designed to only align parallel sections bypassing non-parallel sections of the sentence. We compare the proposed approach with two other alignment methods: (1) the standard phrase extraction algorithm, which relies on the Viterbi path of the word alignment, (2) a binary classifier to detect parallel phrase pairs when presented with a large collection of phrase pair candidates. We evaluate the accuracy of these approaches using a manually aligned data set, and show that the proposed approach outperforms the other two approaches. Finally, we demonstrate the effectiveness of the extracted phrase pairs by using them in Arabic-English and Urdu-English translation systems, which resulted in improvements upto 1.2 Bleu over the baseline. The main contributions of this paper are two-fold: (1) novel phrase alignment algorithms to extract parallel phrase pairs from comparable sentences, (2) evaluating the utility of the extracted phrases by using them directly in the MT decoder.

Original languageEnglish
Pages (from-to)549-573
Number of pages25
JournalNatural Language Engineering
Issue number4
Publication statusPublished - 1 Jul 2016


ASJC Scopus subject areas

  • Software
  • Artificial Intelligence
  • Language and Linguistics
  • Linguistics and Language

Cite this