Diacritization as a machine translation problem and as a sequence labeling problem

Tim Schlippe, Thuy Linh Nguyen, Stephan Vogel

Research output: Chapter in Book/Report/Conference proceedingConference contribution

12 Citations (Scopus)

Abstract

In this paper we describe and compare two techniques for the automatic diacritization of Arabic text: First, we treat diacritization as a monotone machine translation problem, proposing and evaluating several translation and language models, including word and character-based models separately and combined as well as a model which uses statistical machine translation (SMT) to post-edit a rule-based diacritization system. Then we explore a more traditional view of diacritization as a sequence labeling problem, and propose a solution using conditional random fields (Lafferty et al., 2001). All these techniques are compared through word error rate and diacritization error rate both in terms of full diacritization and ignoring vowel endings. The empirical experiments showed that the machine translation approaches perform better than the sequence labeling approaches concerning the error rates.

Original languageEnglish
Title of host publicationAMTA 2008 - 8th Conference of the Association for Machine Translation in the Americas
Publication statusPublished - 1 Dec 2008
Externally publishedYes
Event8th Biennial Conference of the Association for Machine Translation in the Americas, AMTA 2008 - Waikiki, HI, United States
Duration: 21 Oct 200825 Oct 2008

Other

Other8th Biennial Conference of the Association for Machine Translation in the Americas, AMTA 2008
CountryUnited States
CityWaikiki, HI
Period21/10/0825/10/08

Fingerprint

Labeling
Knowledge based systems
Translation Problems
Machine Translation
Experiments
Rule-based Systems
Experiment
Statistical Machine Translation
Language Model

ASJC Scopus subject areas

  • Language and Linguistics
  • Human-Computer Interaction
  • Software

Cite this

Schlippe, T., Nguyen, T. L., & Vogel, S. (2008). Diacritization as a machine translation problem and as a sequence labeling problem. In AMTA 2008 - 8th Conference of the Association for Machine Translation in the Americas

Diacritization as a machine translation problem and as a sequence labeling problem. / Schlippe, Tim; Nguyen, Thuy Linh; Vogel, Stephan.

AMTA 2008 - 8th Conference of the Association for Machine Translation in the Americas. 2008.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Schlippe, T, Nguyen, TL & Vogel, S 2008, Diacritization as a machine translation problem and as a sequence labeling problem. in AMTA 2008 - 8th Conference of the Association for Machine Translation in the Americas. 8th Biennial Conference of the Association for Machine Translation in the Americas, AMTA 2008, Waikiki, HI, United States, 21/10/08.
Schlippe T, Nguyen TL, Vogel S. Diacritization as a machine translation problem and as a sequence labeling problem. In AMTA 2008 - 8th Conference of the Association for Machine Translation in the Americas. 2008
Schlippe, Tim ; Nguyen, Thuy Linh ; Vogel, Stephan. / Diacritization as a machine translation problem and as a sequence labeling problem. AMTA 2008 - 8th Conference of the Association for Machine Translation in the Americas. 2008.
@inproceedings{181b43b8c4d747239ca5a66390c90038,
title = "Diacritization as a machine translation problem and as a sequence labeling problem",
abstract = "In this paper we describe and compare two techniques for the automatic diacritization of Arabic text: First, we treat diacritization as a monotone machine translation problem, proposing and evaluating several translation and language models, including word and character-based models separately and combined as well as a model which uses statistical machine translation (SMT) to post-edit a rule-based diacritization system. Then we explore a more traditional view of diacritization as a sequence labeling problem, and propose a solution using conditional random fields (Lafferty et al., 2001). All these techniques are compared through word error rate and diacritization error rate both in terms of full diacritization and ignoring vowel endings. The empirical experiments showed that the machine translation approaches perform better than the sequence labeling approaches concerning the error rates.",
author = "Tim Schlippe and Nguyen, {Thuy Linh} and Stephan Vogel",
year = "2008",
month = "12",
day = "1",
language = "English",
booktitle = "AMTA 2008 - 8th Conference of the Association for Machine Translation in the Americas",

}

TY - GEN

T1 - Diacritization as a machine translation problem and as a sequence labeling problem

AU - Schlippe, Tim

AU - Nguyen, Thuy Linh

AU - Vogel, Stephan

PY - 2008/12/1

Y1 - 2008/12/1

N2 - In this paper we describe and compare two techniques for the automatic diacritization of Arabic text: First, we treat diacritization as a monotone machine translation problem, proposing and evaluating several translation and language models, including word and character-based models separately and combined as well as a model which uses statistical machine translation (SMT) to post-edit a rule-based diacritization system. Then we explore a more traditional view of diacritization as a sequence labeling problem, and propose a solution using conditional random fields (Lafferty et al., 2001). All these techniques are compared through word error rate and diacritization error rate both in terms of full diacritization and ignoring vowel endings. The empirical experiments showed that the machine translation approaches perform better than the sequence labeling approaches concerning the error rates.

AB - In this paper we describe and compare two techniques for the automatic diacritization of Arabic text: First, we treat diacritization as a monotone machine translation problem, proposing and evaluating several translation and language models, including word and character-based models separately and combined as well as a model which uses statistical machine translation (SMT) to post-edit a rule-based diacritization system. Then we explore a more traditional view of diacritization as a sequence labeling problem, and propose a solution using conditional random fields (Lafferty et al., 2001). All these techniques are compared through word error rate and diacritization error rate both in terms of full diacritization and ignoring vowel endings. The empirical experiments showed that the machine translation approaches perform better than the sequence labeling approaches concerning the error rates.

UR - http://www.scopus.com/inward/record.url?scp=84858027046&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84858027046&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:84858027046

BT - AMTA 2008 - 8th Conference of the Association for Machine Translation in the Americas

ER -