Effect of OCR error correction on Arabic retrieval

Walid Magdy, Kareem Darwish

Research output: Contribution to journalArticle

9 Citations (Scopus)

Abstract

Arabic documents that are available only in print continue to be ubiquitous and they can be scanned and subsequently OCR'ed to ease their retrieval. This paper explores the effect of context-based OCR correction on the effectiveness of retrieving Arabic OCR documents using different index terms. Different OCR correction techniques based on language modeling with different correction abilities were tested on real OCR and synthetic OCR degradation. Results show that the reduction of word error rates needs to pass a certain limit to get a noticeable effect on retrieval. If only moderate error reduction is available, then using short character n-gram for retrieval without error correction is not a bad strategy. Word-based correction in conjunction with language modeling had a statistically significant impact on retrieval even for character 3-grams, which are known to be among the best index terms for OCR degraded Arabic text. Further, using a sufficiently large language model for correction can minimize the need for morphologically sensitive error correction.

Original languageEnglish
Pages (from-to)405-425
Number of pages21
JournalInformation Retrieval
Volume11
Issue number5
DOIs
Publication statusPublished - 1 Oct 2008
Externally publishedYes

Fingerprint

Optical character recognition
Error correction
language
ability
Degradation

Keywords

  • Error correction
  • Information retrieval
  • Language modeling
  • OCR

ASJC Scopus subject areas

  • Information Systems

Cite this

Effect of OCR error correction on Arabic retrieval. / Magdy, Walid; Darwish, Kareem.

In: Information Retrieval, Vol. 11, No. 5, 01.10.2008, p. 405-425.

Research output: Contribution to journalArticle

@article{300e2d94d9ba4a12a62d0cdf8f679962,
title = "Effect of OCR error correction on Arabic retrieval",
abstract = "Arabic documents that are available only in print continue to be ubiquitous and they can be scanned and subsequently OCR'ed to ease their retrieval. This paper explores the effect of context-based OCR correction on the effectiveness of retrieving Arabic OCR documents using different index terms. Different OCR correction techniques based on language modeling with different correction abilities were tested on real OCR and synthetic OCR degradation. Results show that the reduction of word error rates needs to pass a certain limit to get a noticeable effect on retrieval. If only moderate error reduction is available, then using short character n-gram for retrieval without error correction is not a bad strategy. Word-based correction in conjunction with language modeling had a statistically significant impact on retrieval even for character 3-grams, which are known to be among the best index terms for OCR degraded Arabic text. Further, using a sufficiently large language model for correction can minimize the need for morphologically sensitive error correction.",
keywords = "Error correction, Information retrieval, Language modeling, OCR",
author = "Walid Magdy and Kareem Darwish",
year = "2008",
month = "10",
day = "1",
doi = "10.1007/s10791-008-9055-y",
language = "English",
volume = "11",
pages = "405--425",
journal = "Information Retrieval",
issn = "1386-4564",
publisher = "Springer Netherlands",
number = "5",

}

TY - JOUR

T1 - Effect of OCR error correction on Arabic retrieval

AU - Magdy, Walid

AU - Darwish, Kareem

PY - 2008/10/1

Y1 - 2008/10/1

N2 - Arabic documents that are available only in print continue to be ubiquitous and they can be scanned and subsequently OCR'ed to ease their retrieval. This paper explores the effect of context-based OCR correction on the effectiveness of retrieving Arabic OCR documents using different index terms. Different OCR correction techniques based on language modeling with different correction abilities were tested on real OCR and synthetic OCR degradation. Results show that the reduction of word error rates needs to pass a certain limit to get a noticeable effect on retrieval. If only moderate error reduction is available, then using short character n-gram for retrieval without error correction is not a bad strategy. Word-based correction in conjunction with language modeling had a statistically significant impact on retrieval even for character 3-grams, which are known to be among the best index terms for OCR degraded Arabic text. Further, using a sufficiently large language model for correction can minimize the need for morphologically sensitive error correction.

AB - Arabic documents that are available only in print continue to be ubiquitous and they can be scanned and subsequently OCR'ed to ease their retrieval. This paper explores the effect of context-based OCR correction on the effectiveness of retrieving Arabic OCR documents using different index terms. Different OCR correction techniques based on language modeling with different correction abilities were tested on real OCR and synthetic OCR degradation. Results show that the reduction of word error rates needs to pass a certain limit to get a noticeable effect on retrieval. If only moderate error reduction is available, then using short character n-gram for retrieval without error correction is not a bad strategy. Word-based correction in conjunction with language modeling had a statistically significant impact on retrieval even for character 3-grams, which are known to be among the best index terms for OCR degraded Arabic text. Further, using a sufficiently large language model for correction can minimize the need for morphologically sensitive error correction.

KW - Error correction

KW - Information retrieval

KW - Language modeling

KW - OCR

UR - http://www.scopus.com/inward/record.url?scp=50849095136&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=50849095136&partnerID=8YFLogxK

U2 - 10.1007/s10791-008-9055-y

DO - 10.1007/s10791-008-9055-y

M3 - Article

AN - SCOPUS:50849095136

VL - 11

SP - 405

EP - 425

JO - Information Retrieval

JF - Information Retrieval

SN - 1386-4564

IS - 5

ER -