Transliteration mining using large training and test sets

Ali El Kahki, Kareem Darwish, Ahmed Saad El Din, Mohamed Abd El-Wahab

Research output: Chapter in Book/Report/Conference proceedingConference contribution

4 Citations (Scopus)

Abstract

Much previous work on Transliteration Mining (TM) was conducted on short parallel snippets using limited training data, and successful methods tended to favor recall. For such methods, increasing training data may impact precision and application on large comparable texts may impact precision and recall. We adapt a state-of-the-art TM technique with the best reported scores on the ACL 2010 NEWS workshop dataset, namely graph reinforcement, to work with large training sets. The method models observed character mappings between language pairs as a bipartite graph and unseen mappings are induced using random walks. Increasing training data yields more correct initial mappings but induced mappings become more error prone. We introduce parameterized exponential penalty to the formulation of graph reinforcement and we estimate the proper parameters for training sets of varying sizes. The new formulation led to sizable improvements in precision. Mining from large comparable texts leads to the presence of phonetically similar words in target and source texts that may not be transliterations or may adversely impact candidate ranking. To overcome this, we extracted related segments that have high translation overlap, and then we performed TM on them. Segment extraction produced significantly higher precision for three different TM methods.

Original languageEnglish
Title of host publicationNAACL HLT 2012 - 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Conference
PublisherAssociation for Computational Linguistics (ACL)
Pages243-252
Number of pages10
ISBN (Print)1937284204, 9781937284206
Publication statusPublished - 2012
Event2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL HLT 2012 - Montreal, Canada
Duration: 3 Jun 20128 Jun 2012

Other

Other2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL HLT 2012
CountryCanada
CityMontreal
Period3/6/128/6/12

Fingerprint

reinforcement
Reinforcement
training method
ranking
penalty
candidacy
Transliteration
language
Graph
Overlap
Ranking
Source Text
Language
Random Walk

ASJC Scopus subject areas

  • Language and Linguistics
  • Computer Science Applications
  • Linguistics and Language

Cite this

Kahki, A. E., Darwish, K., Din, A. S. E., & El-Wahab, M. A. (2012). Transliteration mining using large training and test sets. In NAACL HLT 2012 - 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Conference (pp. 243-252). Association for Computational Linguistics (ACL).

Transliteration mining using large training and test sets. / Kahki, Ali El; Darwish, Kareem; Din, Ahmed Saad El; El-Wahab, Mohamed Abd.

NAACL HLT 2012 - 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Conference. Association for Computational Linguistics (ACL), 2012. p. 243-252.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Kahki, AE, Darwish, K, Din, ASE & El-Wahab, MA 2012, Transliteration mining using large training and test sets. in NAACL HLT 2012 - 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Conference. Association for Computational Linguistics (ACL), pp. 243-252, 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL HLT 2012, Montreal, Canada, 3/6/12.
Kahki AE, Darwish K, Din ASE, El-Wahab MA. Transliteration mining using large training and test sets. In NAACL HLT 2012 - 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Conference. Association for Computational Linguistics (ACL). 2012. p. 243-252
Kahki, Ali El ; Darwish, Kareem ; Din, Ahmed Saad El ; El-Wahab, Mohamed Abd. / Transliteration mining using large training and test sets. NAACL HLT 2012 - 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Conference. Association for Computational Linguistics (ACL), 2012. pp. 243-252
@inproceedings{6010cc2de62143feab0c827fac1cb4a6,
title = "Transliteration mining using large training and test sets",
abstract = "Much previous work on Transliteration Mining (TM) was conducted on short parallel snippets using limited training data, and successful methods tended to favor recall. For such methods, increasing training data may impact precision and application on large comparable texts may impact precision and recall. We adapt a state-of-the-art TM technique with the best reported scores on the ACL 2010 NEWS workshop dataset, namely graph reinforcement, to work with large training sets. The method models observed character mappings between language pairs as a bipartite graph and unseen mappings are induced using random walks. Increasing training data yields more correct initial mappings but induced mappings become more error prone. We introduce parameterized exponential penalty to the formulation of graph reinforcement and we estimate the proper parameters for training sets of varying sizes. The new formulation led to sizable improvements in precision. Mining from large comparable texts leads to the presence of phonetically similar words in target and source texts that may not be transliterations or may adversely impact candidate ranking. To overcome this, we extracted related segments that have high translation overlap, and then we performed TM on them. Segment extraction produced significantly higher precision for three different TM methods.",
author = "Kahki, {Ali El} and Kareem Darwish and Din, {Ahmed Saad El} and El-Wahab, {Mohamed Abd}",
year = "2012",
language = "English",
isbn = "1937284204",
pages = "243--252",
booktitle = "NAACL HLT 2012 - 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Conference",
publisher = "Association for Computational Linguistics (ACL)",

}

TY - GEN

T1 - Transliteration mining using large training and test sets

AU - Kahki, Ali El

AU - Darwish, Kareem

AU - Din, Ahmed Saad El

AU - El-Wahab, Mohamed Abd

PY - 2012

Y1 - 2012

N2 - Much previous work on Transliteration Mining (TM) was conducted on short parallel snippets using limited training data, and successful methods tended to favor recall. For such methods, increasing training data may impact precision and application on large comparable texts may impact precision and recall. We adapt a state-of-the-art TM technique with the best reported scores on the ACL 2010 NEWS workshop dataset, namely graph reinforcement, to work with large training sets. The method models observed character mappings between language pairs as a bipartite graph and unseen mappings are induced using random walks. Increasing training data yields more correct initial mappings but induced mappings become more error prone. We introduce parameterized exponential penalty to the formulation of graph reinforcement and we estimate the proper parameters for training sets of varying sizes. The new formulation led to sizable improvements in precision. Mining from large comparable texts leads to the presence of phonetically similar words in target and source texts that may not be transliterations or may adversely impact candidate ranking. To overcome this, we extracted related segments that have high translation overlap, and then we performed TM on them. Segment extraction produced significantly higher precision for three different TM methods.

AB - Much previous work on Transliteration Mining (TM) was conducted on short parallel snippets using limited training data, and successful methods tended to favor recall. For such methods, increasing training data may impact precision and application on large comparable texts may impact precision and recall. We adapt a state-of-the-art TM technique with the best reported scores on the ACL 2010 NEWS workshop dataset, namely graph reinforcement, to work with large training sets. The method models observed character mappings between language pairs as a bipartite graph and unseen mappings are induced using random walks. Increasing training data yields more correct initial mappings but induced mappings become more error prone. We introduce parameterized exponential penalty to the formulation of graph reinforcement and we estimate the proper parameters for training sets of varying sizes. The new formulation led to sizable improvements in precision. Mining from large comparable texts leads to the presence of phonetically similar words in target and source texts that may not be transliterations or may adversely impact candidate ranking. To overcome this, we extracted related segments that have high translation overlap, and then we performed TM on them. Segment extraction produced significantly higher precision for three different TM methods.

UR - http://www.scopus.com/inward/record.url?scp=84893542502&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84893542502&partnerID=8YFLogxK

M3 - Conference contribution

SN - 1937284204

SN - 9781937284206

SP - 243

EP - 252

BT - NAACL HLT 2012 - 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Conference

PB - Association for Computational Linguistics (ACL)

ER -