Arabic OCR error correction using character segment correction, language modeling, and shallow morphology

Walid Magdy, Kareem Darwish

Research output: Chapter in Book/Report/Conference proceedingConference contribution

22 Citations (Scopus)

Abstract

This paper explores the use of a character segment based character correction model, language modeling, and shallow morphology for Arabic OCR error correction. Experimentation shows that character segment based correction is superior to single character correction and that language modeling boosts correction, by improving the ranking of candidate corrections, while shallow morphology had a small adverse effect. Further, given sufficiently large corpus to extract a dictionary and to train a language model, word based correction works well for a morphologically rich language such as Arabic.

Original languageEnglish
Title of host publicationCOLING/ACL 2006 - EMNLP 2006: 2006 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference
Pages408-414
Number of pages7
Publication statusPublished - 1 Dec 2006
Externally publishedYes
Event11th Conference on Empirical Methods in Natural Language Proceessing, EMNLP 2006, Held in Conjunction with COLING/ACL 2006 - Sydney, NSW, Australia
Duration: 22 Jul 200623 Jul 2006

Other

Other11th Conference on Empirical Methods in Natural Language Proceessing, EMNLP 2006, Held in Conjunction with COLING/ACL 2006
CountryAustralia
CitySydney, NSW
Period22/7/0623/7/06

Fingerprint

Optical character recognition
Error correction
Glossaries

ASJC Scopus subject areas

  • Computational Theory and Mathematics
  • Computer Science Applications
  • Information Systems

Cite this

Magdy, W., & Darwish, K. (2006). Arabic OCR error correction using character segment correction, language modeling, and shallow morphology. In COLING/ACL 2006 - EMNLP 2006: 2006 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference (pp. 408-414)

Arabic OCR error correction using character segment correction, language modeling, and shallow morphology. / Magdy, Walid; Darwish, Kareem.

COLING/ACL 2006 - EMNLP 2006: 2006 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference. 2006. p. 408-414.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Magdy, W & Darwish, K 2006, Arabic OCR error correction using character segment correction, language modeling, and shallow morphology. in COLING/ACL 2006 - EMNLP 2006: 2006 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference. pp. 408-414, 11th Conference on Empirical Methods in Natural Language Proceessing, EMNLP 2006, Held in Conjunction with COLING/ACL 2006, Sydney, NSW, Australia, 22/7/06.
Magdy W, Darwish K. Arabic OCR error correction using character segment correction, language modeling, and shallow morphology. In COLING/ACL 2006 - EMNLP 2006: 2006 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference. 2006. p. 408-414
Magdy, Walid ; Darwish, Kareem. / Arabic OCR error correction using character segment correction, language modeling, and shallow morphology. COLING/ACL 2006 - EMNLP 2006: 2006 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference. 2006. pp. 408-414
@inproceedings{dc8b346bf5e74cfda756ea97acfa9563,
title = "Arabic OCR error correction using character segment correction, language modeling, and shallow morphology",
abstract = "This paper explores the use of a character segment based character correction model, language modeling, and shallow morphology for Arabic OCR error correction. Experimentation shows that character segment based correction is superior to single character correction and that language modeling boosts correction, by improving the ranking of candidate corrections, while shallow morphology had a small adverse effect. Further, given sufficiently large corpus to extract a dictionary and to train a language model, word based correction works well for a morphologically rich language such as Arabic.",
author = "Walid Magdy and Kareem Darwish",
year = "2006",
month = "12",
day = "1",
language = "English",
isbn = "1932432736",
pages = "408--414",
booktitle = "COLING/ACL 2006 - EMNLP 2006: 2006 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference",

}

TY - GEN

T1 - Arabic OCR error correction using character segment correction, language modeling, and shallow morphology

AU - Magdy, Walid

AU - Darwish, Kareem

PY - 2006/12/1

Y1 - 2006/12/1

N2 - This paper explores the use of a character segment based character correction model, language modeling, and shallow morphology for Arabic OCR error correction. Experimentation shows that character segment based correction is superior to single character correction and that language modeling boosts correction, by improving the ranking of candidate corrections, while shallow morphology had a small adverse effect. Further, given sufficiently large corpus to extract a dictionary and to train a language model, word based correction works well for a morphologically rich language such as Arabic.

AB - This paper explores the use of a character segment based character correction model, language modeling, and shallow morphology for Arabic OCR error correction. Experimentation shows that character segment based correction is superior to single character correction and that language modeling boosts correction, by improving the ranking of candidate corrections, while shallow morphology had a small adverse effect. Further, given sufficiently large corpus to extract a dictionary and to train a language model, word based correction works well for a morphologically rich language such as Arabic.

UR - http://www.scopus.com/inward/record.url?scp=37049008029&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=37049008029&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:37049008029

SN - 1932432736

SN - 9781932432732

SP - 408

EP - 414

BT - COLING/ACL 2006 - EMNLP 2006: 2006 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference

ER -