A statistical information extraction system for Turkish

Gökhan Tür, Dilek Hakkani-Tür, Kemal Oflazer

Research output: Contribution to journalArticle

49 Citations (Scopus)

Abstract

This paper presents the results of a study on information extraction from unrestricted Turkish text using statistical language processing methods. In languages like English, there is a very small number of possible word forms with a given root word. However, languages like Turkish have very productive agglutinative morphology. Thus, it is an issue to build statistical models for specific tasks using the surface forms of the words, mainly because of the data sparseness problem. In order to alleviate this problem, we used additional syntactic information, i.e. the morphological structure of the words. We have successfully applied statistical methods using both the lexical and morphological information to sentence segmentation, topic, segmentation, and name tagging tasks. For sentence segmentation, we have modeled the final inflectional groups of the words and combined it with the lexical model, and decreased the error rate to 4.34%, which is 21% better than the result obtained using only the surface forms of the words. For topic segmentation, stems of the words (especially nouns) have been found to be more effective than using the surface forms of the words and we have achieved 10.90% segmentation error rate on our test set according to the weighted TDT-2 segmentation cost metric. This is 32% better than the word-based baseline model. For name tagging, we used four different information sources to model names. Our first information source is based on the surface forms of the words. Then we combined the contextual cues with the lexical model, and obtained some improvement. After this, we modeled the morphological analyses of the words, and finally we modeled the tag sequence, and reached an F-Measure of 91.56%, according to the MUC evaluation criteria. Our results are important in the sense that, using linguistic information, i.e. morphological analyses of the words, and a corpus large enough to train a statistical model significantly improves these basic information extraction tasks for Turkish.

Original languageEnglish
Pages (from-to)181-210
Number of pages30
JournalNatural Language Engineering
Volume9
Issue number2
DOIs
Publication statusPublished - Jun 2003
Externally publishedYes

Fingerprint

Syntactics
Linguistics
Statistical methods
language
Statistical Information
Information Extraction
statistical method
Processing
English language
segmentation
Costs
linguistics
Segmentation
costs
Statistical Models
evaluation
Surface Form
Group
Names
Statistical Model

ASJC Scopus subject areas

  • Software

Cite this

A statistical information extraction system for Turkish. / Tür, Gökhan; Hakkani-Tür, Dilek; Oflazer, Kemal.

In: Natural Language Engineering, Vol. 9, No. 2, 06.2003, p. 181-210.

Research output: Contribution to journalArticle

Tür, Gökhan ; Hakkani-Tür, Dilek ; Oflazer, Kemal. / A statistical information extraction system for Turkish. In: Natural Language Engineering. 2003 ; Vol. 9, No. 2. pp. 181-210.
@article{a4904c0c7dba4f4d8d4f1e524cf26f16,
title = "A statistical information extraction system for Turkish",
abstract = "This paper presents the results of a study on information extraction from unrestricted Turkish text using statistical language processing methods. In languages like English, there is a very small number of possible word forms with a given root word. However, languages like Turkish have very productive agglutinative morphology. Thus, it is an issue to build statistical models for specific tasks using the surface forms of the words, mainly because of the data sparseness problem. In order to alleviate this problem, we used additional syntactic information, i.e. the morphological structure of the words. We have successfully applied statistical methods using both the lexical and morphological information to sentence segmentation, topic, segmentation, and name tagging tasks. For sentence segmentation, we have modeled the final inflectional groups of the words and combined it with the lexical model, and decreased the error rate to 4.34{\%}, which is 21{\%} better than the result obtained using only the surface forms of the words. For topic segmentation, stems of the words (especially nouns) have been found to be more effective than using the surface forms of the words and we have achieved 10.90{\%} segmentation error rate on our test set according to the weighted TDT-2 segmentation cost metric. This is 32{\%} better than the word-based baseline model. For name tagging, we used four different information sources to model names. Our first information source is based on the surface forms of the words. Then we combined the contextual cues with the lexical model, and obtained some improvement. After this, we modeled the morphological analyses of the words, and finally we modeled the tag sequence, and reached an F-Measure of 91.56{\%}, according to the MUC evaluation criteria. Our results are important in the sense that, using linguistic information, i.e. morphological analyses of the words, and a corpus large enough to train a statistical model significantly improves these basic information extraction tasks for Turkish.",
author = "G{\"o}khan T{\"u}r and Dilek Hakkani-T{\"u}r and Kemal Oflazer",
year = "2003",
month = "6",
doi = "10.1017/S135132490200284X",
language = "English",
volume = "9",
pages = "181--210",
journal = "Natural Language Engineering",
issn = "1351-3249",
publisher = "Cambridge University Press",
number = "2",

}

TY - JOUR

T1 - A statistical information extraction system for Turkish

AU - Tür, Gökhan

AU - Hakkani-Tür, Dilek

AU - Oflazer, Kemal

PY - 2003/6

Y1 - 2003/6

N2 - This paper presents the results of a study on information extraction from unrestricted Turkish text using statistical language processing methods. In languages like English, there is a very small number of possible word forms with a given root word. However, languages like Turkish have very productive agglutinative morphology. Thus, it is an issue to build statistical models for specific tasks using the surface forms of the words, mainly because of the data sparseness problem. In order to alleviate this problem, we used additional syntactic information, i.e. the morphological structure of the words. We have successfully applied statistical methods using both the lexical and morphological information to sentence segmentation, topic, segmentation, and name tagging tasks. For sentence segmentation, we have modeled the final inflectional groups of the words and combined it with the lexical model, and decreased the error rate to 4.34%, which is 21% better than the result obtained using only the surface forms of the words. For topic segmentation, stems of the words (especially nouns) have been found to be more effective than using the surface forms of the words and we have achieved 10.90% segmentation error rate on our test set according to the weighted TDT-2 segmentation cost metric. This is 32% better than the word-based baseline model. For name tagging, we used four different information sources to model names. Our first information source is based on the surface forms of the words. Then we combined the contextual cues with the lexical model, and obtained some improvement. After this, we modeled the morphological analyses of the words, and finally we modeled the tag sequence, and reached an F-Measure of 91.56%, according to the MUC evaluation criteria. Our results are important in the sense that, using linguistic information, i.e. morphological analyses of the words, and a corpus large enough to train a statistical model significantly improves these basic information extraction tasks for Turkish.

AB - This paper presents the results of a study on information extraction from unrestricted Turkish text using statistical language processing methods. In languages like English, there is a very small number of possible word forms with a given root word. However, languages like Turkish have very productive agglutinative morphology. Thus, it is an issue to build statistical models for specific tasks using the surface forms of the words, mainly because of the data sparseness problem. In order to alleviate this problem, we used additional syntactic information, i.e. the morphological structure of the words. We have successfully applied statistical methods using both the lexical and morphological information to sentence segmentation, topic, segmentation, and name tagging tasks. For sentence segmentation, we have modeled the final inflectional groups of the words and combined it with the lexical model, and decreased the error rate to 4.34%, which is 21% better than the result obtained using only the surface forms of the words. For topic segmentation, stems of the words (especially nouns) have been found to be more effective than using the surface forms of the words and we have achieved 10.90% segmentation error rate on our test set according to the weighted TDT-2 segmentation cost metric. This is 32% better than the word-based baseline model. For name tagging, we used four different information sources to model names. Our first information source is based on the surface forms of the words. Then we combined the contextual cues with the lexical model, and obtained some improvement. After this, we modeled the morphological analyses of the words, and finally we modeled the tag sequence, and reached an F-Measure of 91.56%, according to the MUC evaluation criteria. Our results are important in the sense that, using linguistic information, i.e. morphological analyses of the words, and a corpus large enough to train a statistical model significantly improves these basic information extraction tasks for Turkish.

UR - http://www.scopus.com/inward/record.url?scp=0037599624&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=0037599624&partnerID=8YFLogxK

U2 - 10.1017/S135132490200284X

DO - 10.1017/S135132490200284X

M3 - Article

VL - 9

SP - 181

EP - 210

JO - Natural Language Engineering

JF - Natural Language Engineering

SN - 1351-3249

IS - 2

ER -