A single-model approach for Arabic segmentation, POS tagging, and named entity recognition

Abed Alhakim Freihat, Gabor Bella, Hamdy Mubarak, Fausto Giunchiglia

Research output: Chapter in Book/Report/Conference proceedingConference contribution

2 Citations (Scopus)

Abstract

This paper presents an entirely new, one-million-word annotated corpus for a comprehensive, machine-learning-based preprocessing of text in Modern Standard Arabic. Contrary to the conventional pipeline architecture, we solve the NLP tasks of word segmentation, POS tagging and named entity recognition as a single sequence labeling task. This single-component configuration results in a faster operation and is able to provide state-of-the-art precision and recall according to our evaluations. The fine-grained output tag set output by our annotator greatly simplifies downstream tasks such as lemmatization. Provided as a trained OpenNLP component, the annotator is free for research purposes.

Original languageEnglish
Title of host publication2nd International Conference on Natural Language and Speech Processing, ICNLSP 2018
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages1-8
Number of pages8
ISBN (Electronic)9781538645437
DOIs
Publication statusPublished - 6 Jun 2018
Event2nd International Conference on Natural Language and Speech Processing, ICNLSP 2018 - Algiers, Algeria
Duration: 25 Apr 201826 Apr 2018

Other

Other2nd International Conference on Natural Language and Speech Processing, ICNLSP 2018
CountryAlgeria
CityAlgiers
Period25/4/1826/4/18

Fingerprint

Labeling
Learning systems
Pipelines
evaluation
learning
segmentation

Keywords

  • Lemmatization
  • Machine learning
  • Named entity recognition
  • NLP
  • POS tagging
  • Segmentation

ASJC Scopus subject areas

  • Linguistics and Language
  • Communication
  • Artificial Intelligence
  • Signal Processing

Cite this

Freihat, A. A., Bella, G., Mubarak, H., & Giunchiglia, F. (2018). A single-model approach for Arabic segmentation, POS tagging, and named entity recognition. In 2nd International Conference on Natural Language and Speech Processing, ICNLSP 2018 (pp. 1-8). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/ICNLSP.2018.8374393

A single-model approach for Arabic segmentation, POS tagging, and named entity recognition. / Freihat, Abed Alhakim; Bella, Gabor; Mubarak, Hamdy; Giunchiglia, Fausto.

2nd International Conference on Natural Language and Speech Processing, ICNLSP 2018. Institute of Electrical and Electronics Engineers Inc., 2018. p. 1-8.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Freihat, AA, Bella, G, Mubarak, H & Giunchiglia, F 2018, A single-model approach for Arabic segmentation, POS tagging, and named entity recognition. in 2nd International Conference on Natural Language and Speech Processing, ICNLSP 2018. Institute of Electrical and Electronics Engineers Inc., pp. 1-8, 2nd International Conference on Natural Language and Speech Processing, ICNLSP 2018, Algiers, Algeria, 25/4/18. https://doi.org/10.1109/ICNLSP.2018.8374393
Freihat AA, Bella G, Mubarak H, Giunchiglia F. A single-model approach for Arabic segmentation, POS tagging, and named entity recognition. In 2nd International Conference on Natural Language and Speech Processing, ICNLSP 2018. Institute of Electrical and Electronics Engineers Inc. 2018. p. 1-8 https://doi.org/10.1109/ICNLSP.2018.8374393
Freihat, Abed Alhakim ; Bella, Gabor ; Mubarak, Hamdy ; Giunchiglia, Fausto. / A single-model approach for Arabic segmentation, POS tagging, and named entity recognition. 2nd International Conference on Natural Language and Speech Processing, ICNLSP 2018. Institute of Electrical and Electronics Engineers Inc., 2018. pp. 1-8
@inproceedings{c9420bf11e654d6ea29d0a274c3e738e,
title = "A single-model approach for Arabic segmentation, POS tagging, and named entity recognition",
abstract = "This paper presents an entirely new, one-million-word annotated corpus for a comprehensive, machine-learning-based preprocessing of text in Modern Standard Arabic. Contrary to the conventional pipeline architecture, we solve the NLP tasks of word segmentation, POS tagging and named entity recognition as a single sequence labeling task. This single-component configuration results in a faster operation and is able to provide state-of-the-art precision and recall according to our evaluations. The fine-grained output tag set output by our annotator greatly simplifies downstream tasks such as lemmatization. Provided as a trained OpenNLP component, the annotator is free for research purposes.",
keywords = "Lemmatization, Machine learning, Named entity recognition, NLP, POS tagging, Segmentation",
author = "Freihat, {Abed Alhakim} and Gabor Bella and Hamdy Mubarak and Fausto Giunchiglia",
year = "2018",
month = "6",
day = "6",
doi = "10.1109/ICNLSP.2018.8374393",
language = "English",
pages = "1--8",
booktitle = "2nd International Conference on Natural Language and Speech Processing, ICNLSP 2018",
publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

TY - GEN

T1 - A single-model approach for Arabic segmentation, POS tagging, and named entity recognition

AU - Freihat, Abed Alhakim

AU - Bella, Gabor

AU - Mubarak, Hamdy

AU - Giunchiglia, Fausto

PY - 2018/6/6

Y1 - 2018/6/6

N2 - This paper presents an entirely new, one-million-word annotated corpus for a comprehensive, machine-learning-based preprocessing of text in Modern Standard Arabic. Contrary to the conventional pipeline architecture, we solve the NLP tasks of word segmentation, POS tagging and named entity recognition as a single sequence labeling task. This single-component configuration results in a faster operation and is able to provide state-of-the-art precision and recall according to our evaluations. The fine-grained output tag set output by our annotator greatly simplifies downstream tasks such as lemmatization. Provided as a trained OpenNLP component, the annotator is free for research purposes.

AB - This paper presents an entirely new, one-million-word annotated corpus for a comprehensive, machine-learning-based preprocessing of text in Modern Standard Arabic. Contrary to the conventional pipeline architecture, we solve the NLP tasks of word segmentation, POS tagging and named entity recognition as a single sequence labeling task. This single-component configuration results in a faster operation and is able to provide state-of-the-art precision and recall according to our evaluations. The fine-grained output tag set output by our annotator greatly simplifies downstream tasks such as lemmatization. Provided as a trained OpenNLP component, the annotator is free for research purposes.

KW - Lemmatization

KW - Machine learning

KW - Named entity recognition

KW - NLP

KW - POS tagging

KW - Segmentation

UR - http://www.scopus.com/inward/record.url?scp=85049371695&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85049371695&partnerID=8YFLogxK

U2 - 10.1109/ICNLSP.2018.8374393

DO - 10.1109/ICNLSP.2018.8374393

M3 - Conference contribution

SP - 1

EP - 8

BT - 2nd International Conference on Natural Language and Speech Processing, ICNLSP 2018

PB - Institute of Electrical and Electronics Engineers Inc.

ER -