Effective multi-dialectal arabic POS tagging

Kareem Darwish, Mohammed Attia, Hamdy Mubarak, Younes Samih, Ahmed Abdelali, Lluís Màrquez, Mohamed Eldesouki, Laura Kallmeyer

Research output: Contribution to journalArticle

Abstract

This work introduces robust multi-dialectal part of speech tagging trained on an annotated data set of Arabic tweets in four major dialect groups: Egyptian, Levantine, Gulf, and Maghrebi. We implement two different sequence tagging approaches. The first uses conditional random fields (CRFs), while the second combines word- A nd character-based representations in a deep neural network with stacked layers of convolutional and recurrent networks with a CRF output layer. We successfully exploit a variety of features that help generalize our models, such as Brown clusters and stem templates. Also, we develop robust joint models that tag multi-dialectal tweets and outperform uni-dialectal taggers. We achieve a combined accuracy of 92.4% across all dialects, with per dialect results ranging between 90.2% and 95.4%. We obtained the results using a train/dev/test split of 70/10/20 for a data set of 350 tweets per dialect.

Original languageEnglish
JournalNatural Language Engineering
DOIs
Publication statusAccepted/In press - 1 Jan 2020

    Fingerprint

Keywords

  • Arabic
  • Brown clusters
  • Deep neural network
  • Dialects
  • Part-of-speech tagging

ASJC Scopus subject areas

  • Software
  • Language and Linguistics
  • Linguistics and Language
  • Artificial Intelligence

Cite this

Darwish, K., Attia, M., Mubarak, H., Samih, Y., Abdelali, A., Màrquez, L., Eldesouki, M., & Kallmeyer, L. (Accepted/In press). Effective multi-dialectal arabic POS tagging. Natural Language Engineering. https://doi.org/10.1017/S1351324920000078