Tagging Urdu text with parts of speech: A tagger comparison

Hassan Sajjad, Helmut Schmid

Research output: Chapter in Book/Report/Conference proceedingConference contribution

14 Citations (Scopus)

Abstract

In this paper, four state-of-art probabilistic taggers i.e. TnT tagger, TreeTagger, RF tagger and SVM tool, are applied to the Urdu language. For the purpose of the experiment, a syntactic tagset is proposed. A training corpus of 100,000 tokens is used to train the models. Using the lexicon extracted from the training corpus, SVM tool shows the best accuracy of 94.15%. After providing a separate lexicon of 70,568 types, SVM tool again shows the best accuracy of 95.66%.

Original languageEnglish
Title of host publicationEACL 2009 - 12th Conference of the European Chapter of the Association for Computational Linguistics, Proceedings
Pages692-700
Number of pages9
Publication statusPublished - 1 Dec 2009
Externally publishedYes
Event12th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2009 - Athens, Greece
Duration: 30 Mar 20093 Apr 2009

Other

Other12th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2009
CountryGreece
CityAthens
Period30/3/093/4/09

    Fingerprint

ASJC Scopus subject areas

  • Language and Linguistics
  • Linguistics and Language

Cite this

Sajjad, H., & Schmid, H. (2009). Tagging Urdu text with parts of speech: A tagger comparison. In EACL 2009 - 12th Conference of the European Chapter of the Association for Computational Linguistics, Proceedings (pp. 692-700)