Urdu word segmentation

Nadir Durrani, Sarmad Hussain

Research output: Chapter in Book/Report/Conference proceedingConference contribution

29 Citations (Scopus)

Abstract

Word Segmentation is the foremost obligatory task in almost all the NLP applications where the initial phase requires tokenization of input into words. Urdu is amongst the Asian languages that face word segmentation challenge. However, unlike other Asian languages, word segmentation in Urdu not only has space omission errors but also space insertion errors. This paper discusses how orthographic and linguistic features in Urdu trigger these two problems. It also discusses the work that has been done to tokenize input text. We employ a hybrid solution that performs an n-gram ranking on top of rule based maximum matching heuristic. Our best technique gives an error detection of 85.8% and overall accuracy of 95.8%. Further issues and possible future directions are also discussed.

Original languageEnglish
Title of host publicationNAACL HLT 2010 - Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Proceedings of the Main Conference
Pages528-536
Number of pages9
Publication statusPublished - 2010
Externally publishedYes
Event2010 Human Language Technologies Conference ofthe North American Chapter of the Association for Computational Linguistics, NAACL HLT 2010 - Los Angeles, CA, United States
Duration: 2 Jun 20104 Jun 2010

Other

Other2010 Human Language Technologies Conference ofthe North American Chapter of the Association for Computational Linguistics, NAACL HLT 2010
CountryUnited States
CityLos Angeles, CA
Period2/6/104/6/10

Fingerprint

language
ranking
heuristics
linguistics
segmentation
Word Segmentation
Urdu
Asian Languages
Natural Language Processing
Linguistic Features
N-gram
Error Detection
Omission
Insertion
Orthographic
Heuristics
Ranking
Trigger

ASJC Scopus subject areas

  • Language and Linguistics
  • Linguistics and Language

Cite this

Durrani, N., & Hussain, S. (2010). Urdu word segmentation. In NAACL HLT 2010 - Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Proceedings of the Main Conference (pp. 528-536)

Urdu word segmentation. / Durrani, Nadir; Hussain, Sarmad.

NAACL HLT 2010 - Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Proceedings of the Main Conference. 2010. p. 528-536.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Durrani, N & Hussain, S 2010, Urdu word segmentation. in NAACL HLT 2010 - Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Proceedings of the Main Conference. pp. 528-536, 2010 Human Language Technologies Conference ofthe North American Chapter of the Association for Computational Linguistics, NAACL HLT 2010, Los Angeles, CA, United States, 2/6/10.
Durrani N, Hussain S. Urdu word segmentation. In NAACL HLT 2010 - Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Proceedings of the Main Conference. 2010. p. 528-536
Durrani, Nadir ; Hussain, Sarmad. / Urdu word segmentation. NAACL HLT 2010 - Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Proceedings of the Main Conference. 2010. pp. 528-536
@inproceedings{397f9e7ac7804887823447369f068227,
title = "Urdu word segmentation",
abstract = "Word Segmentation is the foremost obligatory task in almost all the NLP applications where the initial phase requires tokenization of input into words. Urdu is amongst the Asian languages that face word segmentation challenge. However, unlike other Asian languages, word segmentation in Urdu not only has space omission errors but also space insertion errors. This paper discusses how orthographic and linguistic features in Urdu trigger these two problems. It also discusses the work that has been done to tokenize input text. We employ a hybrid solution that performs an n-gram ranking on top of rule based maximum matching heuristic. Our best technique gives an error detection of 85.8{\%} and overall accuracy of 95.8{\%}. Further issues and possible future directions are also discussed.",
author = "Nadir Durrani and Sarmad Hussain",
year = "2010",
language = "English",
isbn = "1932432655",
pages = "528--536",
booktitle = "NAACL HLT 2010 - Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Proceedings of the Main Conference",

}

TY - GEN

T1 - Urdu word segmentation

AU - Durrani, Nadir

AU - Hussain, Sarmad

PY - 2010

Y1 - 2010

N2 - Word Segmentation is the foremost obligatory task in almost all the NLP applications where the initial phase requires tokenization of input into words. Urdu is amongst the Asian languages that face word segmentation challenge. However, unlike other Asian languages, word segmentation in Urdu not only has space omission errors but also space insertion errors. This paper discusses how orthographic and linguistic features in Urdu trigger these two problems. It also discusses the work that has been done to tokenize input text. We employ a hybrid solution that performs an n-gram ranking on top of rule based maximum matching heuristic. Our best technique gives an error detection of 85.8% and overall accuracy of 95.8%. Further issues and possible future directions are also discussed.

AB - Word Segmentation is the foremost obligatory task in almost all the NLP applications where the initial phase requires tokenization of input into words. Urdu is amongst the Asian languages that face word segmentation challenge. However, unlike other Asian languages, word segmentation in Urdu not only has space omission errors but also space insertion errors. This paper discusses how orthographic and linguistic features in Urdu trigger these two problems. It also discusses the work that has been done to tokenize input text. We employ a hybrid solution that performs an n-gram ranking on top of rule based maximum matching heuristic. Our best technique gives an error detection of 85.8% and overall accuracy of 95.8%. Further issues and possible future directions are also discussed.

UR - http://www.scopus.com/inward/record.url?scp=79952062201&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=79952062201&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:79952062201

SN - 1932432655

SN - 9781932432657

SP - 528

EP - 536

BT - NAACL HLT 2010 - Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Proceedings of the Main Conference

ER -