Tagging Urdu text with parts of speech

A tagger comparison

Hassan Sajjad, Helmut Schmid

Research output: Chapter in Book/Report/Conference proceedingConference contribution

13 Citations (Scopus)

Abstract

In this paper, four state-of-art probabilistic taggers i.e. TnT tagger, TreeTagger, RF tagger and SVM tool, are applied to the Urdu language. For the purpose of the experiment, a syntactic tagset is proposed. A training corpus of 100,000 tokens is used to train the models. Using the lexicon extracted from the training corpus, SVM tool shows the best accuracy of 94.15%. After providing a separate lexicon of 70,568 types, SVM tool again shows the best accuracy of 95.66%.

Original languageEnglish
Title of host publicationEACL 2009 - 12th Conference of the European Chapter of the Association for Computational Linguistics, Proceedings
Pages692-700
Number of pages9
Publication statusPublished - 1 Dec 2009
Externally publishedYes
Event12th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2009 - Athens, Greece
Duration: 30 Mar 20093 Apr 2009

Other

Other12th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2009
CountryGreece
CityAthens
Period30/3/093/4/09

Fingerprint

experiment
language
Urdu
Part of Speech
Tag
Tagging
Lexicon
Syntax
Language
Train
Experiment

ASJC Scopus subject areas

  • Language and Linguistics
  • Linguistics and Language

Cite this

Sajjad, H., & Schmid, H. (2009). Tagging Urdu text with parts of speech: A tagger comparison. In EACL 2009 - 12th Conference of the European Chapter of the Association for Computational Linguistics, Proceedings (pp. 692-700)

Tagging Urdu text with parts of speech : A tagger comparison. / Sajjad, Hassan; Schmid, Helmut.

EACL 2009 - 12th Conference of the European Chapter of the Association for Computational Linguistics, Proceedings. 2009. p. 692-700.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Sajjad, H & Schmid, H 2009, Tagging Urdu text with parts of speech: A tagger comparison. in EACL 2009 - 12th Conference of the European Chapter of the Association for Computational Linguistics, Proceedings. pp. 692-700, 12th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2009, Athens, Greece, 30/3/09.
Sajjad H, Schmid H. Tagging Urdu text with parts of speech: A tagger comparison. In EACL 2009 - 12th Conference of the European Chapter of the Association for Computational Linguistics, Proceedings. 2009. p. 692-700
Sajjad, Hassan ; Schmid, Helmut. / Tagging Urdu text with parts of speech : A tagger comparison. EACL 2009 - 12th Conference of the European Chapter of the Association for Computational Linguistics, Proceedings. 2009. pp. 692-700
@inproceedings{61f4604053a14e68bed405b8634c0bba,
title = "Tagging Urdu text with parts of speech: A tagger comparison",
abstract = "In this paper, four state-of-art probabilistic taggers i.e. TnT tagger, TreeTagger, RF tagger and SVM tool, are applied to the Urdu language. For the purpose of the experiment, a syntactic tagset is proposed. A training corpus of 100,000 tokens is used to train the models. Using the lexicon extracted from the training corpus, SVM tool shows the best accuracy of 94.15{\%}. After providing a separate lexicon of 70,568 types, SVM tool again shows the best accuracy of 95.66{\%}.",
author = "Hassan Sajjad and Helmut Schmid",
year = "2009",
month = "12",
day = "1",
language = "English",
isbn = "9781932432169",
pages = "692--700",
booktitle = "EACL 2009 - 12th Conference of the European Chapter of the Association for Computational Linguistics, Proceedings",

}

TY - GEN

T1 - Tagging Urdu text with parts of speech

T2 - A tagger comparison

AU - Sajjad, Hassan

AU - Schmid, Helmut

PY - 2009/12/1

Y1 - 2009/12/1

N2 - In this paper, four state-of-art probabilistic taggers i.e. TnT tagger, TreeTagger, RF tagger and SVM tool, are applied to the Urdu language. For the purpose of the experiment, a syntactic tagset is proposed. A training corpus of 100,000 tokens is used to train the models. Using the lexicon extracted from the training corpus, SVM tool shows the best accuracy of 94.15%. After providing a separate lexicon of 70,568 types, SVM tool again shows the best accuracy of 95.66%.

AB - In this paper, four state-of-art probabilistic taggers i.e. TnT tagger, TreeTagger, RF tagger and SVM tool, are applied to the Urdu language. For the purpose of the experiment, a syntactic tagset is proposed. A training corpus of 100,000 tokens is used to train the models. Using the lexicon extracted from the training corpus, SVM tool shows the best accuracy of 94.15%. After providing a separate lexicon of 70,568 types, SVM tool again shows the best accuracy of 95.66%.

UR - http://www.scopus.com/inward/record.url?scp=77952119717&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=77952119717&partnerID=8YFLogxK

M3 - Conference contribution

SN - 9781932432169

SP - 692

EP - 700

BT - EACL 2009 - 12th Conference of the European Chapter of the Association for Computational Linguistics, Proceedings

ER -