Improved transliteration mining using graph reinforcement

Ali El-Kahky, Kareem Darwish, Ahmed Saad Aldein, Mohamed Abd El-Wahab, Ahmed Hefny, Waleed Ammar

Research output: Chapter in Book/Report/Conference proceedingConference contribution

12 Citations (Scopus)

Abstract

Mining of transliterations from comparable or parallel text can enhance natural language processing applications such as machine translation and cross language information retrieval. This paper presents an enhanced transliteration mining technique that uses a generative graph reinforcement model to infer mappings between source and target character sequences. An initial set of mappings are learned through automatic alignment of transliteration pairs at character sequence level. Then, these mappings are modeled using a bipartite graph. A graph reinforcement algorithm is then used to enrich the graph by inferring additional mappings. During graph reinforcement, appropriate link reweighting is used to promote good mappings and to demote bad ones. The enhanced transliteration mining technique is tested in the context of mining transliterations from parallel Wikipedia titles in 4 alphabet-based languages pairs, namely English-Arabic, English-Russian, English-Hindi, and English-Tamil. The improvements in F1-measure over the baseline system were 18.7, 1.0, 4.5, and 32.5 basis points for the four language pairs respectively. The results herein outperform the best reported results in the literature by 2.6, 4.8, 0.8, and 4.1 basis points for the four language pairs respectively.

Original languageEnglish
Title of host publicationEMNLP 2011 - Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference
Pages1384-1393
Number of pages10
Publication statusPublished - 3 Oct 2011
EventConference on Empirical Methods in Natural Language Processing, EMNLP 2011 - Edinburgh, United Kingdom
Duration: 27 Jul 201131 Jul 2011

Other

OtherConference on Empirical Methods in Natural Language Processing, EMNLP 2011
CountryUnited Kingdom
CityEdinburgh
Period27/7/1131/7/11

Fingerprint

Reinforcement
Query languages
Processing

ASJC Scopus subject areas

  • Computational Theory and Mathematics
  • Computer Science Applications
  • Information Systems

Cite this

El-Kahky, A., Darwish, K., Aldein, A. S., El-Wahab, M. A., Hefny, A., & Ammar, W. (2011). Improved transliteration mining using graph reinforcement. In EMNLP 2011 - Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference (pp. 1384-1393)

Improved transliteration mining using graph reinforcement. / El-Kahky, Ali; Darwish, Kareem; Aldein, Ahmed Saad; El-Wahab, Mohamed Abd; Hefny, Ahmed; Ammar, Waleed.

EMNLP 2011 - Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference. 2011. p. 1384-1393.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

El-Kahky, A, Darwish, K, Aldein, AS, El-Wahab, MA, Hefny, A & Ammar, W 2011, Improved transliteration mining using graph reinforcement. in EMNLP 2011 - Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference. pp. 1384-1393, Conference on Empirical Methods in Natural Language Processing, EMNLP 2011, Edinburgh, United Kingdom, 27/7/11.
El-Kahky A, Darwish K, Aldein AS, El-Wahab MA, Hefny A, Ammar W. Improved transliteration mining using graph reinforcement. In EMNLP 2011 - Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference. 2011. p. 1384-1393
El-Kahky, Ali ; Darwish, Kareem ; Aldein, Ahmed Saad ; El-Wahab, Mohamed Abd ; Hefny, Ahmed ; Ammar, Waleed. / Improved transliteration mining using graph reinforcement. EMNLP 2011 - Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference. 2011. pp. 1384-1393
@inproceedings{33900f00695a41c5b93f34debc82d04a,
title = "Improved transliteration mining using graph reinforcement",
abstract = "Mining of transliterations from comparable or parallel text can enhance natural language processing applications such as machine translation and cross language information retrieval. This paper presents an enhanced transliteration mining technique that uses a generative graph reinforcement model to infer mappings between source and target character sequences. An initial set of mappings are learned through automatic alignment of transliteration pairs at character sequence level. Then, these mappings are modeled using a bipartite graph. A graph reinforcement algorithm is then used to enrich the graph by inferring additional mappings. During graph reinforcement, appropriate link reweighting is used to promote good mappings and to demote bad ones. The enhanced transliteration mining technique is tested in the context of mining transliterations from parallel Wikipedia titles in 4 alphabet-based languages pairs, namely English-Arabic, English-Russian, English-Hindi, and English-Tamil. The improvements in F1-measure over the baseline system were 18.7, 1.0, 4.5, and 32.5 basis points for the four language pairs respectively. The results herein outperform the best reported results in the literature by 2.6, 4.8, 0.8, and 4.1 basis points for the four language pairs respectively.",
author = "Ali El-Kahky and Kareem Darwish and Aldein, {Ahmed Saad} and El-Wahab, {Mohamed Abd} and Ahmed Hefny and Waleed Ammar",
year = "2011",
month = "10",
day = "3",
language = "English",
isbn = "1937284115",
pages = "1384--1393",
booktitle = "EMNLP 2011 - Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference",

}

TY - GEN

T1 - Improved transliteration mining using graph reinforcement

AU - El-Kahky, Ali

AU - Darwish, Kareem

AU - Aldein, Ahmed Saad

AU - El-Wahab, Mohamed Abd

AU - Hefny, Ahmed

AU - Ammar, Waleed

PY - 2011/10/3

Y1 - 2011/10/3

N2 - Mining of transliterations from comparable or parallel text can enhance natural language processing applications such as machine translation and cross language information retrieval. This paper presents an enhanced transliteration mining technique that uses a generative graph reinforcement model to infer mappings between source and target character sequences. An initial set of mappings are learned through automatic alignment of transliteration pairs at character sequence level. Then, these mappings are modeled using a bipartite graph. A graph reinforcement algorithm is then used to enrich the graph by inferring additional mappings. During graph reinforcement, appropriate link reweighting is used to promote good mappings and to demote bad ones. The enhanced transliteration mining technique is tested in the context of mining transliterations from parallel Wikipedia titles in 4 alphabet-based languages pairs, namely English-Arabic, English-Russian, English-Hindi, and English-Tamil. The improvements in F1-measure over the baseline system were 18.7, 1.0, 4.5, and 32.5 basis points for the four language pairs respectively. The results herein outperform the best reported results in the literature by 2.6, 4.8, 0.8, and 4.1 basis points for the four language pairs respectively.

AB - Mining of transliterations from comparable or parallel text can enhance natural language processing applications such as machine translation and cross language information retrieval. This paper presents an enhanced transliteration mining technique that uses a generative graph reinforcement model to infer mappings between source and target character sequences. An initial set of mappings are learned through automatic alignment of transliteration pairs at character sequence level. Then, these mappings are modeled using a bipartite graph. A graph reinforcement algorithm is then used to enrich the graph by inferring additional mappings. During graph reinforcement, appropriate link reweighting is used to promote good mappings and to demote bad ones. The enhanced transliteration mining technique is tested in the context of mining transliterations from parallel Wikipedia titles in 4 alphabet-based languages pairs, namely English-Arabic, English-Russian, English-Hindi, and English-Tamil. The improvements in F1-measure over the baseline system were 18.7, 1.0, 4.5, and 32.5 basis points for the four language pairs respectively. The results herein outperform the best reported results in the literature by 2.6, 4.8, 0.8, and 4.1 basis points for the four language pairs respectively.

UR - http://www.scopus.com/inward/record.url?scp=80053263297&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=80053263297&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:80053263297

SN - 1937284115

SN - 9781937284114

SP - 1384

EP - 1393

BT - EMNLP 2011 - Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference

ER -