Farasa: A new fast and accurate Arabic word segmenter

Research output: Chapter in Book/Report/Conference proceedingConference contribution

14 Citations (Scopus)

Abstract

In this paper, we present Farasa (meaning insight in Arabic), which is a fast and accurate Arabic segmenter. Segmentation involves breaking Arabic words into their constituent clitics. Our approach is based on SVMrank using linear kernels. The features that we utilized account for: likelihood of stems, prefixes, suffixes, and their combination; presence in lexicons containing valid stems and named entities; and underlying stem templates. Farasa outperforms or equalizes state-of-the-art Arabic segmenters, namely QATARA and MADAMIRA. Meanwhile, Farasa is nearly one order of magnitude faster than QATARA and two orders of magnitude faster than MADAMIRA. The segmenter should be able to process one billion words in less than 5 hours. Farasa is written entirely in native Java, with no external dependencies, and is open-source.

Original languageEnglish
Title of host publicationProceedings of the 10th International Conference on Language Resources and Evaluation, LREC 2016
PublisherEuropean Language Resources Association (ELRA)
Pages1070-1074
Number of pages5
ISBN (Electronic)9782951740891
Publication statusPublished - 1 Jan 2016
Event10th International Conference on Language Resources and Evaluation, LREC 2016 - Portoroz, Slovenia
Duration: 23 May 201628 May 2016

Other

Other10th International Conference on Language Resources and Evaluation, LREC 2016
CountrySlovenia
CityPortoroz
Period23/5/1628/5/16

Fingerprint

segmentation
Kernel
Java
Template
Open Source
Constituent
Lexicon
Prefix
Entity
Segmentation
Clitics

Keywords

  • Arabic morphology
  • Stemming
  • Word segmentation

ASJC Scopus subject areas

  • Linguistics and Language
  • Library and Information Sciences
  • Language and Linguistics
  • Education

Cite this

Darwish, K., & Mubarak, H. (2016). Farasa: A new fast and accurate Arabic word segmenter. In Proceedings of the 10th International Conference on Language Resources and Evaluation, LREC 2016 (pp. 1070-1074). European Language Resources Association (ELRA).

Farasa : A new fast and accurate Arabic word segmenter. / Darwish, Kareem; Mubarak, Hamdy.

Proceedings of the 10th International Conference on Language Resources and Evaluation, LREC 2016. European Language Resources Association (ELRA), 2016. p. 1070-1074.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Darwish, K & Mubarak, H 2016, Farasa: A new fast and accurate Arabic word segmenter. in Proceedings of the 10th International Conference on Language Resources and Evaluation, LREC 2016. European Language Resources Association (ELRA), pp. 1070-1074, 10th International Conference on Language Resources and Evaluation, LREC 2016, Portoroz, Slovenia, 23/5/16.
Darwish K, Mubarak H. Farasa: A new fast and accurate Arabic word segmenter. In Proceedings of the 10th International Conference on Language Resources and Evaluation, LREC 2016. European Language Resources Association (ELRA). 2016. p. 1070-1074
Darwish, Kareem ; Mubarak, Hamdy. / Farasa : A new fast and accurate Arabic word segmenter. Proceedings of the 10th International Conference on Language Resources and Evaluation, LREC 2016. European Language Resources Association (ELRA), 2016. pp. 1070-1074
@inproceedings{08f00dec526143ea94b2b6358dee46c0,
title = "Farasa: A new fast and accurate Arabic word segmenter",
abstract = "In this paper, we present Farasa (meaning insight in Arabic), which is a fast and accurate Arabic segmenter. Segmentation involves breaking Arabic words into their constituent clitics. Our approach is based on SVMrank using linear kernels. The features that we utilized account for: likelihood of stems, prefixes, suffixes, and their combination; presence in lexicons containing valid stems and named entities; and underlying stem templates. Farasa outperforms or equalizes state-of-the-art Arabic segmenters, namely QATARA and MADAMIRA. Meanwhile, Farasa is nearly one order of magnitude faster than QATARA and two orders of magnitude faster than MADAMIRA. The segmenter should be able to process one billion words in less than 5 hours. Farasa is written entirely in native Java, with no external dependencies, and is open-source.",
keywords = "Arabic morphology, Stemming, Word segmentation",
author = "Kareem Darwish and Hamdy Mubarak",
year = "2016",
month = "1",
day = "1",
language = "English",
pages = "1070--1074",
booktitle = "Proceedings of the 10th International Conference on Language Resources and Evaluation, LREC 2016",
publisher = "European Language Resources Association (ELRA)",

}

TY - GEN

T1 - Farasa

T2 - A new fast and accurate Arabic word segmenter

AU - Darwish, Kareem

AU - Mubarak, Hamdy

PY - 2016/1/1

Y1 - 2016/1/1

N2 - In this paper, we present Farasa (meaning insight in Arabic), which is a fast and accurate Arabic segmenter. Segmentation involves breaking Arabic words into their constituent clitics. Our approach is based on SVMrank using linear kernels. The features that we utilized account for: likelihood of stems, prefixes, suffixes, and their combination; presence in lexicons containing valid stems and named entities; and underlying stem templates. Farasa outperforms or equalizes state-of-the-art Arabic segmenters, namely QATARA and MADAMIRA. Meanwhile, Farasa is nearly one order of magnitude faster than QATARA and two orders of magnitude faster than MADAMIRA. The segmenter should be able to process one billion words in less than 5 hours. Farasa is written entirely in native Java, with no external dependencies, and is open-source.

AB - In this paper, we present Farasa (meaning insight in Arabic), which is a fast and accurate Arabic segmenter. Segmentation involves breaking Arabic words into their constituent clitics. Our approach is based on SVMrank using linear kernels. The features that we utilized account for: likelihood of stems, prefixes, suffixes, and their combination; presence in lexicons containing valid stems and named entities; and underlying stem templates. Farasa outperforms or equalizes state-of-the-art Arabic segmenters, namely QATARA and MADAMIRA. Meanwhile, Farasa is nearly one order of magnitude faster than QATARA and two orders of magnitude faster than MADAMIRA. The segmenter should be able to process one billion words in less than 5 hours. Farasa is written entirely in native Java, with no external dependencies, and is open-source.

KW - Arabic morphology

KW - Stemming

KW - Word segmentation

UR - http://www.scopus.com/inward/record.url?scp=85029368935&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85029368935&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:85029368935

SP - 1070

EP - 1074

BT - Proceedings of the 10th International Conference on Language Resources and Evaluation, LREC 2016

PB - European Language Resources Association (ELRA)

ER -