Statistical models for unsupervised, semi-supervised, and supervised transliteration mining

Hassan Sajjad, Helmut Schmid, Alexander Fraser, Hinrich Schütze

Research output: Contribution to journalArticle

3 Citations (Scopus)

Abstract

We present a generative model that efficiently mines transliteration pairs in a consistent fashion in three different settings: unsupervised, semi-supervised, and supervised transliteration mining. The model interpolates two sub-models, one for the generation of transliteration pairs and one for the generation of non-transliteration pairs (i.e., noise). The model is trained on noisy unlabeled data using the EM algorithm. During training the transliteration submodel learns to generate transliteration pairs and the fixed non-transliteration model generates the noise pairs. After training, the unlabeled data is disambiguated based on the posterior probabilities of the two sub-models. We evaluate our transliteration mining system on data from a transliteration mining shared task and on parallel corpora. For three out of four language pairs, our system outperforms all semi-supervised and supervised systems that participated in the NEWS 2010 shared task. On word pairs extracted from parallel corpora with fewer than 2% transliteration pairs, our system achieves up to 86.7% F-measure with 77.9% precision and 97.8% recall.

Original languageEnglish
Pages (from-to)350-375
Number of pages26
JournalComputational Linguistics
Volume43
Issue number2
DOIs
Publication statusPublished - 1 Jun 2017

Fingerprint

Transliteration
Statistical Model
Statistical Models
language
Parallel Corpora
Language
Generative

ASJC Scopus subject areas

  • Language and Linguistics
  • Linguistics and Language
  • Computer Science Applications
  • Artificial Intelligence

Cite this

Statistical models for unsupervised, semi-supervised, and supervised transliteration mining. / Sajjad, Hassan; Schmid, Helmut; Fraser, Alexander; Schütze, Hinrich.

In: Computational Linguistics, Vol. 43, No. 2, 01.06.2017, p. 350-375.

Research output: Contribution to journalArticle

Sajjad, Hassan ; Schmid, Helmut ; Fraser, Alexander ; Schütze, Hinrich. / Statistical models for unsupervised, semi-supervised, and supervised transliteration mining. In: Computational Linguistics. 2017 ; Vol. 43, No. 2. pp. 350-375.
@article{fb37a065a4d1474fa3ab3b6438df8158,
title = "Statistical models for unsupervised, semi-supervised, and supervised transliteration mining",
abstract = "We present a generative model that efficiently mines transliteration pairs in a consistent fashion in three different settings: unsupervised, semi-supervised, and supervised transliteration mining. The model interpolates two sub-models, one for the generation of transliteration pairs and one for the generation of non-transliteration pairs (i.e., noise). The model is trained on noisy unlabeled data using the EM algorithm. During training the transliteration submodel learns to generate transliteration pairs and the fixed non-transliteration model generates the noise pairs. After training, the unlabeled data is disambiguated based on the posterior probabilities of the two sub-models. We evaluate our transliteration mining system on data from a transliteration mining shared task and on parallel corpora. For three out of four language pairs, our system outperforms all semi-supervised and supervised systems that participated in the NEWS 2010 shared task. On word pairs extracted from parallel corpora with fewer than 2{\%} transliteration pairs, our system achieves up to 86.7{\%} F-measure with 77.9{\%} precision and 97.8{\%} recall.",
author = "Hassan Sajjad and Helmut Schmid and Alexander Fraser and Hinrich Sch{\"u}tze",
year = "2017",
month = "6",
day = "1",
doi = "10.1162/COLI_a_00286",
language = "English",
volume = "43",
pages = "350--375",
journal = "Computational Linguistics",
issn = "0891-2017",
publisher = "MIT Press Journals",
number = "2",

}

TY - JOUR

T1 - Statistical models for unsupervised, semi-supervised, and supervised transliteration mining

AU - Sajjad, Hassan

AU - Schmid, Helmut

AU - Fraser, Alexander

AU - Schütze, Hinrich

PY - 2017/6/1

Y1 - 2017/6/1

N2 - We present a generative model that efficiently mines transliteration pairs in a consistent fashion in three different settings: unsupervised, semi-supervised, and supervised transliteration mining. The model interpolates two sub-models, one for the generation of transliteration pairs and one for the generation of non-transliteration pairs (i.e., noise). The model is trained on noisy unlabeled data using the EM algorithm. During training the transliteration submodel learns to generate transliteration pairs and the fixed non-transliteration model generates the noise pairs. After training, the unlabeled data is disambiguated based on the posterior probabilities of the two sub-models. We evaluate our transliteration mining system on data from a transliteration mining shared task and on parallel corpora. For three out of four language pairs, our system outperforms all semi-supervised and supervised systems that participated in the NEWS 2010 shared task. On word pairs extracted from parallel corpora with fewer than 2% transliteration pairs, our system achieves up to 86.7% F-measure with 77.9% precision and 97.8% recall.

AB - We present a generative model that efficiently mines transliteration pairs in a consistent fashion in three different settings: unsupervised, semi-supervised, and supervised transliteration mining. The model interpolates two sub-models, one for the generation of transliteration pairs and one for the generation of non-transliteration pairs (i.e., noise). The model is trained on noisy unlabeled data using the EM algorithm. During training the transliteration submodel learns to generate transliteration pairs and the fixed non-transliteration model generates the noise pairs. After training, the unlabeled data is disambiguated based on the posterior probabilities of the two sub-models. We evaluate our transliteration mining system on data from a transliteration mining shared task and on parallel corpora. For three out of four language pairs, our system outperforms all semi-supervised and supervised systems that participated in the NEWS 2010 shared task. On word pairs extracted from parallel corpora with fewer than 2% transliteration pairs, our system achieves up to 86.7% F-measure with 77.9% precision and 97.8% recall.

UR - http://www.scopus.com/inward/record.url?scp=85021796432&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85021796432&partnerID=8YFLogxK

U2 - 10.1162/COLI_a_00286

DO - 10.1162/COLI_a_00286

M3 - Article

VL - 43

SP - 350

EP - 375

JO - Computational Linguistics

JF - Computational Linguistics

SN - 0891-2017

IS - 2

ER -