Word-based correction for retrieval of Arabic OCR degraded documents

Walid Magdy, Kareem Darwish

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Arabic documents that are available only in print continue to be ubiquitous and they can be scanned and subsequently OCR'ed to ease their retrieval. This paper explores the effect of word-based OCR correction on the effectiveness of retrieving Arabic OCR documents using different index terms. The OCR correction uses an improved character segment based noisy channel model and is tested on real and synthetic OCR degradation. Results show that the effect of OCR correction depends on the length of the index term used and that indexing using short n-grams is perhaps superior to word-based error correction. The results are potentially applicable to other languages.

Original languageEnglish
Title of host publicationLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Pages205-216
Number of pages12
Volume4209 LNCS
Publication statusPublished - 31 Oct 2006
Externally publishedYes
Event13th International Conference on String Processing and Information Retrieval, SPIRE 2006 - Glasgow, United Kingdom
Duration: 11 Oct 200613 Oct 2006

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume4209 LNCS
ISSN (Print)03029743
ISSN (Electronic)16113349

Other

Other13th International Conference on String Processing and Information Retrieval, SPIRE 2006
CountryUnited Kingdom
CityGlasgow
Period11/10/0613/10/06

Fingerprint

Optical character recognition
Retrieval
Language
N-gram
Channel Model
Error Correction
Term
Indexing
Degradation
Continue
Error correction
Character

Keywords

  • Error correction
  • OCR
  • Retrieval

ASJC Scopus subject areas

  • Computer Science(all)
  • Biochemistry, Genetics and Molecular Biology(all)
  • Theoretical Computer Science

Cite this

Magdy, W., & Darwish, K. (2006). Word-based correction for retrieval of Arabic OCR degraded documents. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 4209 LNCS, pp. 205-216). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 4209 LNCS).

Word-based correction for retrieval of Arabic OCR degraded documents. / Magdy, Walid; Darwish, Kareem.

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). Vol. 4209 LNCS 2006. p. 205-216 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 4209 LNCS).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Magdy, W & Darwish, K 2006, Word-based correction for retrieval of Arabic OCR degraded documents. in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). vol. 4209 LNCS, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 4209 LNCS, pp. 205-216, 13th International Conference on String Processing and Information Retrieval, SPIRE 2006, Glasgow, United Kingdom, 11/10/06.
Magdy W, Darwish K. Word-based correction for retrieval of Arabic OCR degraded documents. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). Vol. 4209 LNCS. 2006. p. 205-216. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).
Magdy, Walid ; Darwish, Kareem. / Word-based correction for retrieval of Arabic OCR degraded documents. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). Vol. 4209 LNCS 2006. pp. 205-216 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).
@inproceedings{84618da8adf44d26b066bc9f117bb706,
title = "Word-based correction for retrieval of Arabic OCR degraded documents",
abstract = "Arabic documents that are available only in print continue to be ubiquitous and they can be scanned and subsequently OCR'ed to ease their retrieval. This paper explores the effect of word-based OCR correction on the effectiveness of retrieving Arabic OCR documents using different index terms. The OCR correction uses an improved character segment based noisy channel model and is tested on real and synthetic OCR degradation. Results show that the effect of OCR correction depends on the length of the index term used and that indexing using short n-grams is perhaps superior to word-based error correction. The results are potentially applicable to other languages.",
keywords = "Error correction, OCR, Retrieval",
author = "Walid Magdy and Kareem Darwish",
year = "2006",
month = "10",
day = "31",
language = "English",
isbn = "3540457747",
volume = "4209 LNCS",
series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",
pages = "205--216",
booktitle = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",

}

TY - GEN

T1 - Word-based correction for retrieval of Arabic OCR degraded documents

AU - Magdy, Walid

AU - Darwish, Kareem

PY - 2006/10/31

Y1 - 2006/10/31

N2 - Arabic documents that are available only in print continue to be ubiquitous and they can be scanned and subsequently OCR'ed to ease their retrieval. This paper explores the effect of word-based OCR correction on the effectiveness of retrieving Arabic OCR documents using different index terms. The OCR correction uses an improved character segment based noisy channel model and is tested on real and synthetic OCR degradation. Results show that the effect of OCR correction depends on the length of the index term used and that indexing using short n-grams is perhaps superior to word-based error correction. The results are potentially applicable to other languages.

AB - Arabic documents that are available only in print continue to be ubiquitous and they can be scanned and subsequently OCR'ed to ease their retrieval. This paper explores the effect of word-based OCR correction on the effectiveness of retrieving Arabic OCR documents using different index terms. The OCR correction uses an improved character segment based noisy channel model and is tested on real and synthetic OCR degradation. Results show that the effect of OCR correction depends on the length of the index term used and that indexing using short n-grams is perhaps superior to word-based error correction. The results are potentially applicable to other languages.

KW - Error correction

KW - OCR

KW - Retrieval

UR - http://www.scopus.com/inward/record.url?scp=33750374299&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=33750374299&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:33750374299

SN - 3540457747

SN - 9783540457749

VL - 4209 LNCS

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 205

EP - 216

BT - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

ER -