Verifiably effective arabic dialect identification

Research output: Chapter in Book/Report/Conference proceedingConference contribution

12 Citations (Scopus)

Abstract

Several recent papers on Arabic dialect identification have hinted that using a word unigram model is sufficient and effective for the task. However, most previous work was done on a standard fairly homogeneous dataset of dialectal user comments. In this paper, we show that training on the standard dataset does not generalize, because a unigram model may be tuned to topics in the comments and does not capture the distinguishing features of dialects. We show that effective dialect identification requires that we account for the distinguishing lexical, morphological, and phonological phenomena of dialects. We show that accounting for such can improve dialect detection accuracy by nearly 10% absolute.

Original languageEnglish
Title of host publicationEMNLP 2014 - 2014 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference
PublisherAssociation for Computational Linguistics (ACL)
Pages1465-1468
Number of pages4
ISBN (Electronic)9781937284961
Publication statusPublished - 2014
Event2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014 - Doha, Qatar
Duration: 25 Oct 201429 Oct 2014

Other

Other2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014
CountryQatar
CityDoha
Period25/10/1429/10/14

Fingerprint

Identification (control systems)

ASJC Scopus subject areas

  • Computational Theory and Mathematics
  • Computer Vision and Pattern Recognition
  • Information Systems

Cite this

Darwish, K., Sajjad, H., & Mubarak, H. (2014). Verifiably effective arabic dialect identification. In EMNLP 2014 - 2014 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference (pp. 1465-1468). Association for Computational Linguistics (ACL).

Verifiably effective arabic dialect identification. / Darwish, Kareem; Sajjad, Hassan; Mubarak, Hamdy.

EMNLP 2014 - 2014 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference. Association for Computational Linguistics (ACL), 2014. p. 1465-1468.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Darwish, K, Sajjad, H & Mubarak, H 2014, Verifiably effective arabic dialect identification. in EMNLP 2014 - 2014 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference. Association for Computational Linguistics (ACL), pp. 1465-1468, 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, Doha, Qatar, 25/10/14.
Darwish K, Sajjad H, Mubarak H. Verifiably effective arabic dialect identification. In EMNLP 2014 - 2014 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference. Association for Computational Linguistics (ACL). 2014. p. 1465-1468
Darwish, Kareem ; Sajjad, Hassan ; Mubarak, Hamdy. / Verifiably effective arabic dialect identification. EMNLP 2014 - 2014 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference. Association for Computational Linguistics (ACL), 2014. pp. 1465-1468
@inproceedings{f953c861d2834dbcba245aab318689b1,
title = "Verifiably effective arabic dialect identification",
abstract = "Several recent papers on Arabic dialect identification have hinted that using a word unigram model is sufficient and effective for the task. However, most previous work was done on a standard fairly homogeneous dataset of dialectal user comments. In this paper, we show that training on the standard dataset does not generalize, because a unigram model may be tuned to topics in the comments and does not capture the distinguishing features of dialects. We show that effective dialect identification requires that we account for the distinguishing lexical, morphological, and phonological phenomena of dialects. We show that accounting for such can improve dialect detection accuracy by nearly 10{\%} absolute.",
author = "Kareem Darwish and Hassan Sajjad and Hamdy Mubarak",
year = "2014",
language = "English",
pages = "1465--1468",
booktitle = "EMNLP 2014 - 2014 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference",
publisher = "Association for Computational Linguistics (ACL)",

}

TY - GEN

T1 - Verifiably effective arabic dialect identification

AU - Darwish, Kareem

AU - Sajjad, Hassan

AU - Mubarak, Hamdy

PY - 2014

Y1 - 2014

N2 - Several recent papers on Arabic dialect identification have hinted that using a word unigram model is sufficient and effective for the task. However, most previous work was done on a standard fairly homogeneous dataset of dialectal user comments. In this paper, we show that training on the standard dataset does not generalize, because a unigram model may be tuned to topics in the comments and does not capture the distinguishing features of dialects. We show that effective dialect identification requires that we account for the distinguishing lexical, morphological, and phonological phenomena of dialects. We show that accounting for such can improve dialect detection accuracy by nearly 10% absolute.

AB - Several recent papers on Arabic dialect identification have hinted that using a word unigram model is sufficient and effective for the task. However, most previous work was done on a standard fairly homogeneous dataset of dialectal user comments. In this paper, we show that training on the standard dataset does not generalize, because a unigram model may be tuned to topics in the comments and does not capture the distinguishing features of dialects. We show that effective dialect identification requires that we account for the distinguishing lexical, morphological, and phonological phenomena of dialects. We show that accounting for such can improve dialect detection accuracy by nearly 10% absolute.

UR - http://www.scopus.com/inward/record.url?scp=84961291368&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84961291368&partnerID=8YFLogxK

M3 - Conference contribution

SP - 1465

EP - 1468

BT - EMNLP 2014 - 2014 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference

PB - Association for Computational Linguistics (ACL)

ER -