Omni font OCR error correction with effect on retrieval

Walid Magdy, Kareem Darwish

Research output: Chapter in Book/Report/Conference proceedingConference contribution

2 Citations (Scopus)

Abstract

Recent library digitization projects attempt to provide large collections of printed material from varying sources in a searchable format. The scanned documents are typically processed using Optical Character Recognition (OCR), which typically introduces errors in the text. This paper proposes a technique for correction of OCR degraded text that is independent of character-level OCR errors, and hence independent of scanned document source. It is based on language modeling in conjunction with a uniform character model that uses edit distance only. The technique compares well to state-of-the-art correction techniques that are based on language modeling and source-specific character error models. Although the proposed technique yielded lower correction effectiveness, its impact on retrieval effectiveness is statistically significant and at par with state-of-the-art correction techniques. The main requirement of the proposed technique is the training of a "good" language model matching genre, style, and temporal coverage. The advantage of being independent of character level errors is clear in applications were printed documents vary in source, font, and degradation level.

Original languageEnglish
Title of host publicationProceedings of the 2010 10th International Conference on Intelligent Systems Design and Applications, ISDA'10
Pages415-420
Number of pages6
DOIs
Publication statusPublished - 1 Dec 2010
Externally publishedYes
Event2010 10th International Conference on Intelligent Systems Design and Applications, ISDA'10 - Cairo, Egypt
Duration: 29 Nov 20101 Dec 2010

Other

Other2010 10th International Conference on Intelligent Systems Design and Applications, ISDA'10
CountryEgypt
CityCairo
Period29/11/101/12/10

Fingerprint

Optical character recognition
Error correction
Analog to digital conversion
Degradation

Keywords

  • Arabic text
  • Error correction
  • Information retrieval
  • Language modeling
  • OCR

ASJC Scopus subject areas

  • Artificial Intelligence
  • Computer Science Applications
  • Hardware and Architecture

Cite this

Magdy, W., & Darwish, K. (2010). Omni font OCR error correction with effect on retrieval. In Proceedings of the 2010 10th International Conference on Intelligent Systems Design and Applications, ISDA'10 (pp. 415-420). [5687228] https://doi.org/10.1109/ISDA.2010.5687228

Omni font OCR error correction with effect on retrieval. / Magdy, Walid; Darwish, Kareem.

Proceedings of the 2010 10th International Conference on Intelligent Systems Design and Applications, ISDA'10. 2010. p. 415-420 5687228.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Magdy, W & Darwish, K 2010, Omni font OCR error correction with effect on retrieval. in Proceedings of the 2010 10th International Conference on Intelligent Systems Design and Applications, ISDA'10., 5687228, pp. 415-420, 2010 10th International Conference on Intelligent Systems Design and Applications, ISDA'10, Cairo, Egypt, 29/11/10. https://doi.org/10.1109/ISDA.2010.5687228
Magdy W, Darwish K. Omni font OCR error correction with effect on retrieval. In Proceedings of the 2010 10th International Conference on Intelligent Systems Design and Applications, ISDA'10. 2010. p. 415-420. 5687228 https://doi.org/10.1109/ISDA.2010.5687228
Magdy, Walid ; Darwish, Kareem. / Omni font OCR error correction with effect on retrieval. Proceedings of the 2010 10th International Conference on Intelligent Systems Design and Applications, ISDA'10. 2010. pp. 415-420
@inproceedings{49f7362ff9004e3a97aef636987d34ee,
title = "Omni font OCR error correction with effect on retrieval",
abstract = "Recent library digitization projects attempt to provide large collections of printed material from varying sources in a searchable format. The scanned documents are typically processed using Optical Character Recognition (OCR), which typically introduces errors in the text. This paper proposes a technique for correction of OCR degraded text that is independent of character-level OCR errors, and hence independent of scanned document source. It is based on language modeling in conjunction with a uniform character model that uses edit distance only. The technique compares well to state-of-the-art correction techniques that are based on language modeling and source-specific character error models. Although the proposed technique yielded lower correction effectiveness, its impact on retrieval effectiveness is statistically significant and at par with state-of-the-art correction techniques. The main requirement of the proposed technique is the training of a {"}good{"} language model matching genre, style, and temporal coverage. The advantage of being independent of character level errors is clear in applications were printed documents vary in source, font, and degradation level.",
keywords = "Arabic text, Error correction, Information retrieval, Language modeling, OCR",
author = "Walid Magdy and Kareem Darwish",
year = "2010",
month = "12",
day = "1",
doi = "10.1109/ISDA.2010.5687228",
language = "English",
isbn = "9781424481354",
pages = "415--420",
booktitle = "Proceedings of the 2010 10th International Conference on Intelligent Systems Design and Applications, ISDA'10",

}

TY - GEN

T1 - Omni font OCR error correction with effect on retrieval

AU - Magdy, Walid

AU - Darwish, Kareem

PY - 2010/12/1

Y1 - 2010/12/1

N2 - Recent library digitization projects attempt to provide large collections of printed material from varying sources in a searchable format. The scanned documents are typically processed using Optical Character Recognition (OCR), which typically introduces errors in the text. This paper proposes a technique for correction of OCR degraded text that is independent of character-level OCR errors, and hence independent of scanned document source. It is based on language modeling in conjunction with a uniform character model that uses edit distance only. The technique compares well to state-of-the-art correction techniques that are based on language modeling and source-specific character error models. Although the proposed technique yielded lower correction effectiveness, its impact on retrieval effectiveness is statistically significant and at par with state-of-the-art correction techniques. The main requirement of the proposed technique is the training of a "good" language model matching genre, style, and temporal coverage. The advantage of being independent of character level errors is clear in applications were printed documents vary in source, font, and degradation level.

AB - Recent library digitization projects attempt to provide large collections of printed material from varying sources in a searchable format. The scanned documents are typically processed using Optical Character Recognition (OCR), which typically introduces errors in the text. This paper proposes a technique for correction of OCR degraded text that is independent of character-level OCR errors, and hence independent of scanned document source. It is based on language modeling in conjunction with a uniform character model that uses edit distance only. The technique compares well to state-of-the-art correction techniques that are based on language modeling and source-specific character error models. Although the proposed technique yielded lower correction effectiveness, its impact on retrieval effectiveness is statistically significant and at par with state-of-the-art correction techniques. The main requirement of the proposed technique is the training of a "good" language model matching genre, style, and temporal coverage. The advantage of being independent of character level errors is clear in applications were printed documents vary in source, font, and degradation level.

KW - Arabic text

KW - Error correction

KW - Information retrieval

KW - Language modeling

KW - OCR

UR - http://www.scopus.com/inward/record.url?scp=79851475788&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=79851475788&partnerID=8YFLogxK

U2 - 10.1109/ISDA.2010.5687228

DO - 10.1109/ISDA.2010.5687228

M3 - Conference contribution

SN - 9781424481354

SP - 415

EP - 420

BT - Proceedings of the 2010 10th International Conference on Intelligent Systems Design and Applications, ISDA'10

ER -