Diacritization as a machine translation problem and as a sequence labeling problem

Tim Schlippe, Thuy Linh Nguyen, Stephan Vogel

Research output: Chapter in Book/Report/Conference proceedingConference contribution

12 Citations (Scopus)

Abstract

In this paper we describe and compare two techniques for the automatic diacritization of Arabic text: First, we treat diacritization as a monotone machine translation problem, proposing and evaluating several translation and language models, including word and character-based models separately and combined as well as a model which uses statistical machine translation (SMT) to post-edit a rule-based diacritization system. Then we explore a more traditional view of diacritization as a sequence labeling problem, and propose a solution using conditional random fields (Lafferty et al., 2001). All these techniques are compared through word error rate and diacritization error rate both in terms of full diacritization and ignoring vowel endings. The empirical experiments showed that the machine translation approaches perform better than the sequence labeling approaches concerning the error rates.

Original languageEnglish
Title of host publicationAMTA 2008 - 8th Conference of the Association for Machine Translation in the Americas
Publication statusPublished - 1 Dec 2008
Externally publishedYes
Event8th Biennial Conference of the Association for Machine Translation in the Americas, AMTA 2008 - Waikiki, HI, United States
Duration: 21 Oct 200825 Oct 2008

Other

Other8th Biennial Conference of the Association for Machine Translation in the Americas, AMTA 2008
CountryUnited States
CityWaikiki, HI
Period21/10/0825/10/08

    Fingerprint

ASJC Scopus subject areas

  • Language and Linguistics
  • Human-Computer Interaction
  • Software

Cite this

Schlippe, T., Nguyen, T. L., & Vogel, S. (2008). Diacritization as a machine translation problem and as a sequence labeling problem. In AMTA 2008 - 8th Conference of the Association for Machine Translation in the Americas