Urdu word segmentation

Nadir Durrani, Sarmad Hussain

Research output: Chapter in Book/Report/Conference proceedingConference contribution

31 Citations (Scopus)

Abstract

Word Segmentation is the foremost obligatory task in almost all the NLP applications where the initial phase requires tokenization of input into words. Urdu is amongst the Asian languages that face word segmentation challenge. However, unlike other Asian languages, word segmentation in Urdu not only has space omission errors but also space insertion errors. This paper discusses how orthographic and linguistic features in Urdu trigger these two problems. It also discusses the work that has been done to tokenize input text. We employ a hybrid solution that performs an n-gram ranking on top of rule based maximum matching heuristic. Our best technique gives an error detection of 85.8% and overall accuracy of 95.8%. Further issues and possible future directions are also discussed.

Original languageEnglish
Title of host publicationNAACL HLT 2010 - Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Proceedings of the Main Conference
Pages528-536
Number of pages9
Publication statusPublished - 2010
Externally publishedYes
Event2010 Human Language Technologies Conference ofthe North American Chapter of the Association for Computational Linguistics, NAACL HLT 2010 - Los Angeles, CA, United States
Duration: 2 Jun 20104 Jun 2010

Other

Other2010 Human Language Technologies Conference ofthe North American Chapter of the Association for Computational Linguistics, NAACL HLT 2010
CountryUnited States
CityLos Angeles, CA
Period2/6/104/6/10

    Fingerprint

ASJC Scopus subject areas

  • Language and Linguistics
  • Linguistics and Language

Cite this

Durrani, N., & Hussain, S. (2010). Urdu word segmentation. In NAACL HLT 2010 - Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Proceedings of the Main Conference (pp. 528-536)