Error correction vs. query garbling for Arabic OCR document retrieval

Kareem Darwish, Walid Magdy

Research output: Contribution to journalArticle

3 Citations (Scopus)

Abstract

Due to the existence of large numbers of legacy documents (such as old books and newspapers), improving retrieval effectiveness for OCR'ed documents continues to be an important problem. This article compares the effect of OCR error correction with and without language modeling and the effect of query garbling with weighted structured queries on the retrieval of OCR degraded Arabic documents. The results suggest that moderate error correction does not yield statistically significant improvement in retrieval effectiveness when indexing and searching using n-grams. Also, reversing error correction models to perform query garbling in conjunction with weighted structured queries yields improved retrieval effectiveness. Lastly, using very good error correction that utilizes language modeling yields the best improvement in retrieval effectiveness.

Original languageEnglish
Article number5
JournalACM Transactions on Information Systems
Volume26
Issue number1
DOIs
Publication statusPublished - 1 Nov 2007
Externally publishedYes

Fingerprint

Optical character recognition
Error correction
Query

Keywords

  • Arabic retrieval
  • OCR correction
  • OCR retrieval

ASJC Scopus subject areas

  • Information Systems

Cite this

Error correction vs. query garbling for Arabic OCR document retrieval. / Darwish, Kareem; Magdy, Walid.

In: ACM Transactions on Information Systems, Vol. 26, No. 1, 5, 01.11.2007.

Research output: Contribution to journalArticle

@article{8e5774eefe3a4b1699485f477a3a0d0c,
title = "Error correction vs. query garbling for Arabic OCR document retrieval",
abstract = "Due to the existence of large numbers of legacy documents (such as old books and newspapers), improving retrieval effectiveness for OCR'ed documents continues to be an important problem. This article compares the effect of OCR error correction with and without language modeling and the effect of query garbling with weighted structured queries on the retrieval of OCR degraded Arabic documents. The results suggest that moderate error correction does not yield statistically significant improvement in retrieval effectiveness when indexing and searching using n-grams. Also, reversing error correction models to perform query garbling in conjunction with weighted structured queries yields improved retrieval effectiveness. Lastly, using very good error correction that utilizes language modeling yields the best improvement in retrieval effectiveness.",
keywords = "Arabic retrieval, OCR correction, OCR retrieval",
author = "Kareem Darwish and Walid Magdy",
year = "2007",
month = "11",
day = "1",
doi = "10.1145/1292591.1292596",
language = "English",
volume = "26",
journal = "ACM Transactions on Information Systems",
issn = "1046-8188",
publisher = "Association for Computing Machinery (ACM)",
number = "1",

}

TY - JOUR

T1 - Error correction vs. query garbling for Arabic OCR document retrieval

AU - Darwish, Kareem

AU - Magdy, Walid

PY - 2007/11/1

Y1 - 2007/11/1

N2 - Due to the existence of large numbers of legacy documents (such as old books and newspapers), improving retrieval effectiveness for OCR'ed documents continues to be an important problem. This article compares the effect of OCR error correction with and without language modeling and the effect of query garbling with weighted structured queries on the retrieval of OCR degraded Arabic documents. The results suggest that moderate error correction does not yield statistically significant improvement in retrieval effectiveness when indexing and searching using n-grams. Also, reversing error correction models to perform query garbling in conjunction with weighted structured queries yields improved retrieval effectiveness. Lastly, using very good error correction that utilizes language modeling yields the best improvement in retrieval effectiveness.

AB - Due to the existence of large numbers of legacy documents (such as old books and newspapers), improving retrieval effectiveness for OCR'ed documents continues to be an important problem. This article compares the effect of OCR error correction with and without language modeling and the effect of query garbling with weighted structured queries on the retrieval of OCR degraded Arabic documents. The results suggest that moderate error correction does not yield statistically significant improvement in retrieval effectiveness when indexing and searching using n-grams. Also, reversing error correction models to perform query garbling in conjunction with weighted structured queries yields improved retrieval effectiveness. Lastly, using very good error correction that utilizes language modeling yields the best improvement in retrieval effectiveness.

KW - Arabic retrieval

KW - OCR correction

KW - OCR retrieval

UR - http://www.scopus.com/inward/record.url?scp=37049013904&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=37049013904&partnerID=8YFLogxK

U2 - 10.1145/1292591.1292596

DO - 10.1145/1292591.1292596

M3 - Article

VL - 26

JO - ACM Transactions on Information Systems

JF - ACM Transactions on Information Systems

SN - 1046-8188

IS - 1

M1 - 5

ER -