Extracting parallel phrases from comparable data

Sanjika Hewavitharana, Stephan Vogel

Research output: Chapter in Book/Report/Conference proceedingChapter

5 Citations (Scopus)

Abstract

Mining parallel data from comparable corpora is a promising approach for overcoming the data sparseness in statistical machine translation and other NLP applications. Even if two comparable documents have few or no parallel sentence pairs, there is still potential for parallelism in the sub-sentential level. The ability to detect these phrases creates a valuable resource, especially for low-resource languages. In this chapter we explore three phrase alignment approaches to detect parallel phrase pairs embedded in comparable sentences: the standard phrase extraction algorithm, which relies on the Viterbi path; a phrase extraction approach that does not rely on the Viterbi path of word alignments, but uses only lexical features; and a binary classifier that detects parallel phrase pairs when presented with a large collection of phrase pair candidates. We evaluate the effectiveness of these approaches in detecting alignments for phrase pairs that have a known alignment in comparable sentence pairs. The results showthat the non-Viterbi alignment approach outperforms the other two approaches in terms of F-measure.

Original languageEnglish
Title of host publicationBuilding and Using Comparable Corpora
PublisherSpringer Berlin Heidelberg
Pages191-204
Number of pages14
ISBN (Print)9783642201288, 9783642201271
DOIs
Publication statusPublished - 1 Jan 2013
Externally publishedYes

Fingerprint

Classifiers

ASJC Scopus subject areas

  • Computer Science(all)

Cite this

Hewavitharana, S., & Vogel, S. (2013). Extracting parallel phrases from comparable data. In Building and Using Comparable Corpora (pp. 191-204). Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-642-20128-8_10

Extracting parallel phrases from comparable data. / Hewavitharana, Sanjika; Vogel, Stephan.

Building and Using Comparable Corpora. Springer Berlin Heidelberg, 2013. p. 191-204.

Research output: Chapter in Book/Report/Conference proceedingChapter

Hewavitharana, S & Vogel, S 2013, Extracting parallel phrases from comparable data. in Building and Using Comparable Corpora. Springer Berlin Heidelberg, pp. 191-204. https://doi.org/10.1007/978-3-642-20128-8_10
Hewavitharana S, Vogel S. Extracting parallel phrases from comparable data. In Building and Using Comparable Corpora. Springer Berlin Heidelberg. 2013. p. 191-204 https://doi.org/10.1007/978-3-642-20128-8_10
Hewavitharana, Sanjika ; Vogel, Stephan. / Extracting parallel phrases from comparable data. Building and Using Comparable Corpora. Springer Berlin Heidelberg, 2013. pp. 191-204
@inbook{03871f0f0d464724b1f5c2c69b2dd7a0,
title = "Extracting parallel phrases from comparable data",
abstract = "Mining parallel data from comparable corpora is a promising approach for overcoming the data sparseness in statistical machine translation and other NLP applications. Even if two comparable documents have few or no parallel sentence pairs, there is still potential for parallelism in the sub-sentential level. The ability to detect these phrases creates a valuable resource, especially for low-resource languages. In this chapter we explore three phrase alignment approaches to detect parallel phrase pairs embedded in comparable sentences: the standard phrase extraction algorithm, which relies on the Viterbi path; a phrase extraction approach that does not rely on the Viterbi path of word alignments, but uses only lexical features; and a binary classifier that detects parallel phrase pairs when presented with a large collection of phrase pair candidates. We evaluate the effectiveness of these approaches in detecting alignments for phrase pairs that have a known alignment in comparable sentence pairs. The results showthat the non-Viterbi alignment approach outperforms the other two approaches in terms of F-measure.",
author = "Sanjika Hewavitharana and Stephan Vogel",
year = "2013",
month = "1",
day = "1",
doi = "10.1007/978-3-642-20128-8_10",
language = "English",
isbn = "9783642201288",
pages = "191--204",
booktitle = "Building and Using Comparable Corpora",
publisher = "Springer Berlin Heidelberg",

}

TY - CHAP

T1 - Extracting parallel phrases from comparable data

AU - Hewavitharana, Sanjika

AU - Vogel, Stephan

PY - 2013/1/1

Y1 - 2013/1/1

N2 - Mining parallel data from comparable corpora is a promising approach for overcoming the data sparseness in statistical machine translation and other NLP applications. Even if two comparable documents have few or no parallel sentence pairs, there is still potential for parallelism in the sub-sentential level. The ability to detect these phrases creates a valuable resource, especially for low-resource languages. In this chapter we explore three phrase alignment approaches to detect parallel phrase pairs embedded in comparable sentences: the standard phrase extraction algorithm, which relies on the Viterbi path; a phrase extraction approach that does not rely on the Viterbi path of word alignments, but uses only lexical features; and a binary classifier that detects parallel phrase pairs when presented with a large collection of phrase pair candidates. We evaluate the effectiveness of these approaches in detecting alignments for phrase pairs that have a known alignment in comparable sentence pairs. The results showthat the non-Viterbi alignment approach outperforms the other two approaches in terms of F-measure.

AB - Mining parallel data from comparable corpora is a promising approach for overcoming the data sparseness in statistical machine translation and other NLP applications. Even if two comparable documents have few or no parallel sentence pairs, there is still potential for parallelism in the sub-sentential level. The ability to detect these phrases creates a valuable resource, especially for low-resource languages. In this chapter we explore three phrase alignment approaches to detect parallel phrase pairs embedded in comparable sentences: the standard phrase extraction algorithm, which relies on the Viterbi path; a phrase extraction approach that does not rely on the Viterbi path of word alignments, but uses only lexical features; and a binary classifier that detects parallel phrase pairs when presented with a large collection of phrase pair candidates. We evaluate the effectiveness of these approaches in detecting alignments for phrase pairs that have a known alignment in comparable sentence pairs. The results showthat the non-Viterbi alignment approach outperforms the other two approaches in terms of F-measure.

UR - http://www.scopus.com/inward/record.url?scp=84956538657&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84956538657&partnerID=8YFLogxK

U2 - 10.1007/978-3-642-20128-8_10

DO - 10.1007/978-3-642-20128-8_10

M3 - Chapter

AN - SCOPUS:84956538657

SN - 9783642201288

SN - 9783642201271

SP - 191

EP - 204

BT - Building and Using Comparable Corpora

PB - Springer Berlin Heidelberg

ER -