Farasa: A new fast and accurate Arabic word segmenter

Research output: Chapter in Book/Report/Conference proceedingConference contribution

16 Citations (Scopus)

Abstract

In this paper, we present Farasa (meaning insight in Arabic), which is a fast and accurate Arabic segmenter. Segmentation involves breaking Arabic words into their constituent clitics. Our approach is based on SVMrank using linear kernels. The features that we utilized account for: likelihood of stems, prefixes, suffixes, and their combination; presence in lexicons containing valid stems and named entities; and underlying stem templates. Farasa outperforms or equalizes state-of-the-art Arabic segmenters, namely QATARA and MADAMIRA. Meanwhile, Farasa is nearly one order of magnitude faster than QATARA and two orders of magnitude faster than MADAMIRA. The segmenter should be able to process one billion words in less than 5 hours. Farasa is written entirely in native Java, with no external dependencies, and is open-source.

Original languageEnglish
Title of host publicationProceedings of the 10th International Conference on Language Resources and Evaluation, LREC 2016
PublisherEuropean Language Resources Association (ELRA)
Pages1070-1074
Number of pages5
ISBN (Electronic)9782951740891
Publication statusPublished - 1 Jan 2016
Event10th International Conference on Language Resources and Evaluation, LREC 2016 - Portoroz, Slovenia
Duration: 23 May 201628 May 2016

Other

Other10th International Conference on Language Resources and Evaluation, LREC 2016
CountrySlovenia
CityPortoroz
Period23/5/1628/5/16

    Fingerprint

Keywords

  • Arabic morphology
  • Stemming
  • Word segmentation

ASJC Scopus subject areas

  • Linguistics and Language
  • Library and Information Sciences
  • Language and Linguistics
  • Education

Cite this

Darwish, K., & Mubarak, H. (2016). Farasa: A new fast and accurate Arabic word segmenter. In Proceedings of the 10th International Conference on Language Resources and Evaluation, LREC 2016 (pp. 1070-1074). European Language Resources Association (ELRA).