Statistical morphological disambiguation for agglutinative languages

Dilek Z. Hakkani-Tür, Kemal Oflazer, Gökhan Tur

Research output: Contribution to journalArticle

4 Citations (Scopus)

Abstract

We present statistical models for morphological disambiguation in agglutinative languages, with a specific application to Turkish. Turkish presents an interesting problem for statistical models as the potential tag set size is very large because of the productive derivational morphology. We propose to handle this by breaking up the morhosyntactic tags into inflectional groups, each of which contains the inflectional features for each (intermediate) derived form. Our statistical models score the probability of each morhosyntactic tag by considering statistics over the individual inflectional groups and surface roots in trigram models. Among the four models that we have developed and tested, the simplest model ignoring the local morphotactics within words performs the best. Our best trigram model performs with 93.95% accuracy on our test data getting all the morhosyntactic and semantic features correct. If we are just interested in syntactically relevant features and ignore a very small set of semantic features, then the accuracy increases to 95.07%.

Original languageEnglish
Pages (from-to)381-410
Number of pages30
JournalLanguage Resources and Evaluation
Volume36
Issue number4
Publication statusPublished - 2002
Externally publishedYes

Fingerprint

language
semantics
Disambiguation
Agglutinative Language
Group
Tag
Statistical Model
statistics
Semantic Features
Trigram
Derivational Morphology
Statistics
Intermediate
Syntax

Keywords

  • Agglutinative languages
  • Morphological disambiguation
  • N-gram language models
  • Statistical natural language processing
  • Turkish

ASJC Scopus subject areas

  • Linguistics and Language

Cite this

Hakkani-Tür, D. Z., Oflazer, K., & Tur, G. (2002). Statistical morphological disambiguation for agglutinative languages. Language Resources and Evaluation, 36(4), 381-410.

Statistical morphological disambiguation for agglutinative languages. / Hakkani-Tür, Dilek Z.; Oflazer, Kemal; Tur, Gökhan.

In: Language Resources and Evaluation, Vol. 36, No. 4, 2002, p. 381-410.

Research output: Contribution to journalArticle

Hakkani-Tür, DZ, Oflazer, K & Tur, G 2002, 'Statistical morphological disambiguation for agglutinative languages', Language Resources and Evaluation, vol. 36, no. 4, pp. 381-410.
Hakkani-Tür, Dilek Z. ; Oflazer, Kemal ; Tur, Gökhan. / Statistical morphological disambiguation for agglutinative languages. In: Language Resources and Evaluation. 2002 ; Vol. 36, No. 4. pp. 381-410.
@article{ce85536e5ab142cca8c2c7002822e82b,
title = "Statistical morphological disambiguation for agglutinative languages",
abstract = "We present statistical models for morphological disambiguation in agglutinative languages, with a specific application to Turkish. Turkish presents an interesting problem for statistical models as the potential tag set size is very large because of the productive derivational morphology. We propose to handle this by breaking up the morhosyntactic tags into inflectional groups, each of which contains the inflectional features for each (intermediate) derived form. Our statistical models score the probability of each morhosyntactic tag by considering statistics over the individual inflectional groups and surface roots in trigram models. Among the four models that we have developed and tested, the simplest model ignoring the local morphotactics within words performs the best. Our best trigram model performs with 93.95{\%} accuracy on our test data getting all the morhosyntactic and semantic features correct. If we are just interested in syntactically relevant features and ignore a very small set of semantic features, then the accuracy increases to 95.07{\%}.",
keywords = "Agglutinative languages, Morphological disambiguation, N-gram language models, Statistical natural language processing, Turkish",
author = "Hakkani-T{\"u}r, {Dilek Z.} and Kemal Oflazer and G{\"o}khan Tur",
year = "2002",
language = "English",
volume = "36",
pages = "381--410",
journal = "Language Resources and Evaluation",
issn = "1574-020X",
publisher = "Springer Netherlands",
number = "4",

}

TY - JOUR

T1 - Statistical morphological disambiguation for agglutinative languages

AU - Hakkani-Tür, Dilek Z.

AU - Oflazer, Kemal

AU - Tur, Gökhan

PY - 2002

Y1 - 2002

N2 - We present statistical models for morphological disambiguation in agglutinative languages, with a specific application to Turkish. Turkish presents an interesting problem for statistical models as the potential tag set size is very large because of the productive derivational morphology. We propose to handle this by breaking up the morhosyntactic tags into inflectional groups, each of which contains the inflectional features for each (intermediate) derived form. Our statistical models score the probability of each morhosyntactic tag by considering statistics over the individual inflectional groups and surface roots in trigram models. Among the four models that we have developed and tested, the simplest model ignoring the local morphotactics within words performs the best. Our best trigram model performs with 93.95% accuracy on our test data getting all the morhosyntactic and semantic features correct. If we are just interested in syntactically relevant features and ignore a very small set of semantic features, then the accuracy increases to 95.07%.

AB - We present statistical models for morphological disambiguation in agglutinative languages, with a specific application to Turkish. Turkish presents an interesting problem for statistical models as the potential tag set size is very large because of the productive derivational morphology. We propose to handle this by breaking up the morhosyntactic tags into inflectional groups, each of which contains the inflectional features for each (intermediate) derived form. Our statistical models score the probability of each morhosyntactic tag by considering statistics over the individual inflectional groups and surface roots in trigram models. Among the four models that we have developed and tested, the simplest model ignoring the local morphotactics within words performs the best. Our best trigram model performs with 93.95% accuracy on our test data getting all the morhosyntactic and semantic features correct. If we are just interested in syntactically relevant features and ignore a very small set of semantic features, then the accuracy increases to 95.07%.

KW - Agglutinative languages

KW - Morphological disambiguation

KW - N-gram language models

KW - Statistical natural language processing

KW - Turkish

UR - http://www.scopus.com/inward/record.url?scp=52649167899&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=52649167899&partnerID=8YFLogxK

M3 - Article

VL - 36

SP - 381

EP - 410

JO - Language Resources and Evaluation

JF - Language Resources and Evaluation

SN - 1574-020X

IS - 4

ER -