Arabic OCR error correction using character segment correction, language modeling, and shallow morphology

Walid Magdy, Kareem Darwish

Research output: Chapter in Book/Report/Conference proceedingConference contribution

23 Citations (Scopus)

Abstract

This paper explores the use of a character segment based character correction model, language modeling, and shallow morphology for Arabic OCR error correction. Experimentation shows that character segment based correction is superior to single character correction and that language modeling boosts correction, by improving the ranking of candidate corrections, while shallow morphology had a small adverse effect. Further, given sufficiently large corpus to extract a dictionary and to train a language model, word based correction works well for a morphologically rich language such as Arabic.

Original languageEnglish
Title of host publicationCOLING/ACL 2006 - EMNLP 2006
Subtitle of host publication2006 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference
Pages408-414
Number of pages7
Publication statusPublished - 1 Dec 2006
Event11th Conference on Empirical Methods in Natural Language Proceessing, EMNLP 2006, Held in Conjunction with COLING/ACL 2006 - Sydney, NSW, Australia
Duration: 22 Jul 200623 Jul 2006

Publication series

NameCOLING/ACL 2006 - EMNLP 2006: 2006 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference

Other

Other11th Conference on Empirical Methods in Natural Language Proceessing, EMNLP 2006, Held in Conjunction with COLING/ACL 2006
CountryAustralia
CitySydney, NSW
Period22/7/0623/7/06

    Fingerprint

ASJC Scopus subject areas

  • Computational Theory and Mathematics
  • Computer Science Applications
  • Information Systems

Cite this

Magdy, W., & Darwish, K. (2006). Arabic OCR error correction using character segment correction, language modeling, and shallow morphology. In COLING/ACL 2006 - EMNLP 2006: 2006 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference (pp. 408-414). (COLING/ACL 2006 - EMNLP 2006: 2006 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference).