Part-of-speech tagging for Arabic Gulf dialect using Bi-LSTM

Randah Alharbi, Walid Magdy, Kareem Darwish, Ahmed Abdelali, Hamdy Mubarak

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Part-of-speech (POS) tagging is one of the most important addressed areas in the natural language processing (NLP). There are effective POS taggers for many languages including Arabic. However, POS research for Arabic focused mainly on Modern Standard Arabic (MSA), while less attention was directed towards Dialect Arabic (DA). MSA is the formal variant which is mainly found in news and formal text books, while DA is the informal spoken Arabic that varies among different regions in the Arab world. DA is heavily used online due to the large spread of social media, which increased research directions towards building NLP tools for DA. Most research on DA focuses on Egyptian and Levantine, while much less attention is given to the Gulf dialect. In this paper, we present a more effective POS tagger for the Arabic Gulf dialect than currently available Arabic POS taggers. Our work includes preparing a POS tagging dataset, engineering multiple sets of features, and applying two machine learning methods, namely Support Vector Machine (SVM) classifier and bi-directional Long Short Term Memory (Bi-LSTM) for sequence modeling. We have improved POS tagging for Gulf dialect from 75% accuracy using a state-of-the-art MSA POS tagger to over 91% accuracy using a Bi-LSTM labeler.

Original languageEnglish
Title of host publicationLREC 2018 - 11th International Conference on Language Resources and Evaluation
EditorsHitoshi Isahara, Bente Maegaard, Stelios Piperidis, Christopher Cieri, Thierry Declerck, Koiti Hasida, Helene Mazo, Khalid Choukri, Sara Goggi, Joseph Mariani, Asuncion Moreno, Nicoletta Calzolari, Jan Odijk, Takenobu Tokunaga
PublisherEuropean Language Resources Association (ELRA)
Pages3925-3932
Number of pages8
ISBN (Electronic)9791095546009
Publication statusPublished - 1 Jan 2019
Event11th International Conference on Language Resources and Evaluation, LREC 2018 - Miyazaki, Japan
Duration: 7 May 201812 May 2018

Other

Other11th International Conference on Language Resources and Evaluation, LREC 2018
CountryJapan
CityMiyazaki
Period7/5/1812/5/18

Fingerprint

dialect
language
Short-term Memory
Part of Speech
Arabic Dialects
Part-of-speech Tagging
learning method
social media
Arab
news
engineering
Tag
Natural Language Processing

Keywords

  • Bidirectional Long Short Term Memory (Bi-LSTM)
  • Dialectal Arabic (DA)
  • Gulf Arabic (GA)
  • Part-of-Speech (POS)

ASJC Scopus subject areas

  • Linguistics and Language
  • Education
  • Library and Information Sciences
  • Language and Linguistics

Cite this

Alharbi, R., Magdy, W., Darwish, K., Abdelali, A., & Mubarak, H. (2019). Part-of-speech tagging for Arabic Gulf dialect using Bi-LSTM. In H. Isahara, B. Maegaard, S. Piperidis, C. Cieri, T. Declerck, K. Hasida, H. Mazo, K. Choukri, S. Goggi, J. Mariani, A. Moreno, N. Calzolari, J. Odijk, ... T. Tokunaga (Eds.), LREC 2018 - 11th International Conference on Language Resources and Evaluation (pp. 3925-3932). European Language Resources Association (ELRA).

Part-of-speech tagging for Arabic Gulf dialect using Bi-LSTM. / Alharbi, Randah; Magdy, Walid; Darwish, Kareem; Abdelali, Ahmed; Mubarak, Hamdy.

LREC 2018 - 11th International Conference on Language Resources and Evaluation. ed. / Hitoshi Isahara; Bente Maegaard; Stelios Piperidis; Christopher Cieri; Thierry Declerck; Koiti Hasida; Helene Mazo; Khalid Choukri; Sara Goggi; Joseph Mariani; Asuncion Moreno; Nicoletta Calzolari; Jan Odijk; Takenobu Tokunaga. European Language Resources Association (ELRA), 2019. p. 3925-3932.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Alharbi, R, Magdy, W, Darwish, K, Abdelali, A & Mubarak, H 2019, Part-of-speech tagging for Arabic Gulf dialect using Bi-LSTM. in H Isahara, B Maegaard, S Piperidis, C Cieri, T Declerck, K Hasida, H Mazo, K Choukri, S Goggi, J Mariani, A Moreno, N Calzolari, J Odijk & T Tokunaga (eds), LREC 2018 - 11th International Conference on Language Resources and Evaluation. European Language Resources Association (ELRA), pp. 3925-3932, 11th International Conference on Language Resources and Evaluation, LREC 2018, Miyazaki, Japan, 7/5/18.
Alharbi R, Magdy W, Darwish K, Abdelali A, Mubarak H. Part-of-speech tagging for Arabic Gulf dialect using Bi-LSTM. In Isahara H, Maegaard B, Piperidis S, Cieri C, Declerck T, Hasida K, Mazo H, Choukri K, Goggi S, Mariani J, Moreno A, Calzolari N, Odijk J, Tokunaga T, editors, LREC 2018 - 11th International Conference on Language Resources and Evaluation. European Language Resources Association (ELRA). 2019. p. 3925-3932
Alharbi, Randah ; Magdy, Walid ; Darwish, Kareem ; Abdelali, Ahmed ; Mubarak, Hamdy. / Part-of-speech tagging for Arabic Gulf dialect using Bi-LSTM. LREC 2018 - 11th International Conference on Language Resources and Evaluation. editor / Hitoshi Isahara ; Bente Maegaard ; Stelios Piperidis ; Christopher Cieri ; Thierry Declerck ; Koiti Hasida ; Helene Mazo ; Khalid Choukri ; Sara Goggi ; Joseph Mariani ; Asuncion Moreno ; Nicoletta Calzolari ; Jan Odijk ; Takenobu Tokunaga. European Language Resources Association (ELRA), 2019. pp. 3925-3932
@inproceedings{e1ee78e33f214da181fc607a85e4219d,
title = "Part-of-speech tagging for Arabic Gulf dialect using Bi-LSTM",
abstract = "Part-of-speech (POS) tagging is one of the most important addressed areas in the natural language processing (NLP). There are effective POS taggers for many languages including Arabic. However, POS research for Arabic focused mainly on Modern Standard Arabic (MSA), while less attention was directed towards Dialect Arabic (DA). MSA is the formal variant which is mainly found in news and formal text books, while DA is the informal spoken Arabic that varies among different regions in the Arab world. DA is heavily used online due to the large spread of social media, which increased research directions towards building NLP tools for DA. Most research on DA focuses on Egyptian and Levantine, while much less attention is given to the Gulf dialect. In this paper, we present a more effective POS tagger for the Arabic Gulf dialect than currently available Arabic POS taggers. Our work includes preparing a POS tagging dataset, engineering multiple sets of features, and applying two machine learning methods, namely Support Vector Machine (SVM) classifier and bi-directional Long Short Term Memory (Bi-LSTM) for sequence modeling. We have improved POS tagging for Gulf dialect from 75{\%} accuracy using a state-of-the-art MSA POS tagger to over 91{\%} accuracy using a Bi-LSTM labeler.",
keywords = "Bidirectional Long Short Term Memory (Bi-LSTM), Dialectal Arabic (DA), Gulf Arabic (GA), Part-of-Speech (POS)",
author = "Randah Alharbi and Walid Magdy and Kareem Darwish and Ahmed Abdelali and Hamdy Mubarak",
year = "2019",
month = "1",
day = "1",
language = "English",
pages = "3925--3932",
editor = "Hitoshi Isahara and Bente Maegaard and Stelios Piperidis and Christopher Cieri and Thierry Declerck and Koiti Hasida and Helene Mazo and Khalid Choukri and Sara Goggi and Joseph Mariani and Asuncion Moreno and Nicoletta Calzolari and Jan Odijk and Takenobu Tokunaga",
booktitle = "LREC 2018 - 11th International Conference on Language Resources and Evaluation",
publisher = "European Language Resources Association (ELRA)",

}

TY - GEN

T1 - Part-of-speech tagging for Arabic Gulf dialect using Bi-LSTM

AU - Alharbi, Randah

AU - Magdy, Walid

AU - Darwish, Kareem

AU - Abdelali, Ahmed

AU - Mubarak, Hamdy

PY - 2019/1/1

Y1 - 2019/1/1

N2 - Part-of-speech (POS) tagging is one of the most important addressed areas in the natural language processing (NLP). There are effective POS taggers for many languages including Arabic. However, POS research for Arabic focused mainly on Modern Standard Arabic (MSA), while less attention was directed towards Dialect Arabic (DA). MSA is the formal variant which is mainly found in news and formal text books, while DA is the informal spoken Arabic that varies among different regions in the Arab world. DA is heavily used online due to the large spread of social media, which increased research directions towards building NLP tools for DA. Most research on DA focuses on Egyptian and Levantine, while much less attention is given to the Gulf dialect. In this paper, we present a more effective POS tagger for the Arabic Gulf dialect than currently available Arabic POS taggers. Our work includes preparing a POS tagging dataset, engineering multiple sets of features, and applying two machine learning methods, namely Support Vector Machine (SVM) classifier and bi-directional Long Short Term Memory (Bi-LSTM) for sequence modeling. We have improved POS tagging for Gulf dialect from 75% accuracy using a state-of-the-art MSA POS tagger to over 91% accuracy using a Bi-LSTM labeler.

AB - Part-of-speech (POS) tagging is one of the most important addressed areas in the natural language processing (NLP). There are effective POS taggers for many languages including Arabic. However, POS research for Arabic focused mainly on Modern Standard Arabic (MSA), while less attention was directed towards Dialect Arabic (DA). MSA is the formal variant which is mainly found in news and formal text books, while DA is the informal spoken Arabic that varies among different regions in the Arab world. DA is heavily used online due to the large spread of social media, which increased research directions towards building NLP tools for DA. Most research on DA focuses on Egyptian and Levantine, while much less attention is given to the Gulf dialect. In this paper, we present a more effective POS tagger for the Arabic Gulf dialect than currently available Arabic POS taggers. Our work includes preparing a POS tagging dataset, engineering multiple sets of features, and applying two machine learning methods, namely Support Vector Machine (SVM) classifier and bi-directional Long Short Term Memory (Bi-LSTM) for sequence modeling. We have improved POS tagging for Gulf dialect from 75% accuracy using a state-of-the-art MSA POS tagger to over 91% accuracy using a Bi-LSTM labeler.

KW - Bidirectional Long Short Term Memory (Bi-LSTM)

KW - Dialectal Arabic (DA)

KW - Gulf Arabic (GA)

KW - Part-of-Speech (POS)

UR - http://www.scopus.com/inward/record.url?scp=85059908306&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85059908306&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:85059908306

SP - 3925

EP - 3932

BT - LREC 2018 - 11th International Conference on Language Resources and Evaluation

A2 - Isahara, Hitoshi

A2 - Maegaard, Bente

A2 - Piperidis, Stelios

A2 - Cieri, Christopher

A2 - Declerck, Thierry

A2 - Hasida, Koiti

A2 - Mazo, Helene

A2 - Choukri, Khalid

A2 - Goggi, Sara

A2 - Mariani, Joseph

A2 - Moreno, Asuncion

A2 - Calzolari, Nicoletta

A2 - Odijk, Jan

A2 - Tokunaga, Takenobu

PB - European Language Resources Association (ELRA)

ER -