Abstract
This paper presents an entirely new, one-million-word annotated corpus for a comprehensive, machine-learning-based preprocessing of text in Modern Standard Arabic. Contrary to the conventional pipeline architecture, we solve the NLP tasks of word segmentation, POS tagging and named entity recognition as a single sequence labeling task. This single-component configuration results in a faster operation and is able to provide state-of-the-art precision and recall according to our evaluations. The fine-grained output tag set output by our annotator greatly simplifies downstream tasks such as lemmatization. Provided as a trained OpenNLP component, the annotator is free for research purposes.
Original language | English |
---|---|
Title of host publication | 2nd International Conference on Natural Language and Speech Processing, ICNLSP 2018 |
Publisher | Institute of Electrical and Electronics Engineers Inc. |
Pages | 1-8 |
Number of pages | 8 |
ISBN (Electronic) | 9781538645437 |
DOIs | |
Publication status | Published - 6 Jun 2018 |
Event | 2nd International Conference on Natural Language and Speech Processing, ICNLSP 2018 - Algiers, Algeria Duration: 25 Apr 2018 → 26 Apr 2018 |
Other
Other | 2nd International Conference on Natural Language and Speech Processing, ICNLSP 2018 |
---|---|
Country | Algeria |
City | Algiers |
Period | 25/4/18 → 26/4/18 |
Fingerprint
Keywords
- Lemmatization
- Machine learning
- Named entity recognition
- NLP
- POS tagging
- Segmentation
ASJC Scopus subject areas
- Linguistics and Language
- Communication
- Artificial Intelligence
- Signal Processing
Cite this
A single-model approach for Arabic segmentation, POS tagging, and named entity recognition. / Freihat, Abed Alhakim; Bella, Gabor; Mubarak, Hamdy; Giunchiglia, Fausto.
2nd International Conference on Natural Language and Speech Processing, ICNLSP 2018. Institute of Electrical and Electronics Engineers Inc., 2018. p. 1-8.Research output: Chapter in Book/Report/Conference proceeding › Conference contribution
}
TY - GEN
T1 - A single-model approach for Arabic segmentation, POS tagging, and named entity recognition
AU - Freihat, Abed Alhakim
AU - Bella, Gabor
AU - Mubarak, Hamdy
AU - Giunchiglia, Fausto
PY - 2018/6/6
Y1 - 2018/6/6
N2 - This paper presents an entirely new, one-million-word annotated corpus for a comprehensive, machine-learning-based preprocessing of text in Modern Standard Arabic. Contrary to the conventional pipeline architecture, we solve the NLP tasks of word segmentation, POS tagging and named entity recognition as a single sequence labeling task. This single-component configuration results in a faster operation and is able to provide state-of-the-art precision and recall according to our evaluations. The fine-grained output tag set output by our annotator greatly simplifies downstream tasks such as lemmatization. Provided as a trained OpenNLP component, the annotator is free for research purposes.
AB - This paper presents an entirely new, one-million-word annotated corpus for a comprehensive, machine-learning-based preprocessing of text in Modern Standard Arabic. Contrary to the conventional pipeline architecture, we solve the NLP tasks of word segmentation, POS tagging and named entity recognition as a single sequence labeling task. This single-component configuration results in a faster operation and is able to provide state-of-the-art precision and recall according to our evaluations. The fine-grained output tag set output by our annotator greatly simplifies downstream tasks such as lemmatization. Provided as a trained OpenNLP component, the annotator is free for research purposes.
KW - Lemmatization
KW - Machine learning
KW - Named entity recognition
KW - NLP
KW - POS tagging
KW - Segmentation
UR - http://www.scopus.com/inward/record.url?scp=85049371695&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85049371695&partnerID=8YFLogxK
U2 - 10.1109/ICNLSP.2018.8374393
DO - 10.1109/ICNLSP.2018.8374393
M3 - Conference contribution
AN - SCOPUS:85049371695
SP - 1
EP - 8
BT - 2nd International Conference on Natural Language and Speech Processing, ICNLSP 2018
PB - Institute of Electrical and Electronics Engineers Inc.
ER -