Term selection for searching printed Arabic

Kareem Darwish, Douglas W. Oard

Research output: Contribution to journalArticle

39 Citations (Scopus)

Abstract

Since many Arabic documents are available only in print, automating retrieval from collections of scanned Arabic document images using Optical Character Recognition (OCR) is an interesting problem. Arabic combines rich morphology with a writing system that presents unique challenges to OCR systems. These factors must be considered when selecting terms for automatic indexing. In this paper, alternative choices of indexing terms are explored using both an existing electronic text collection and a newly developed collection built from images of actual printed Arabic documents. Character n-grams or lightly stemmed words were found to typically yield near-optimal retrieval effectiveness, and combining both types of terms resulted in robust performance across a broad range of conditions.

Original languageEnglish
Pages (from-to)261-268
Number of pages8
JournalSIGIR Forum (ACM Special Interest Group on Information Retrieval)
Publication statusPublished - 1 Dec 2002
Externally publishedYes

Fingerprint

Optical character recognition
Automatic indexing
Indexing

Keywords

  • Arabic
  • Information retrieval
  • OCR
  • Term selection

ASJC Scopus subject areas

  • Management Information Systems
  • Hardware and Architecture

Cite this

Term selection for searching printed Arabic. / Darwish, Kareem; Oard, Douglas W.

In: SIGIR Forum (ACM Special Interest Group on Information Retrieval), 01.12.2002, p. 261-268.

Research output: Contribution to journalArticle

@article{9452b4a45f714c30985827782ccc827c,
title = "Term selection for searching printed Arabic",
abstract = "Since many Arabic documents are available only in print, automating retrieval from collections of scanned Arabic document images using Optical Character Recognition (OCR) is an interesting problem. Arabic combines rich morphology with a writing system that presents unique challenges to OCR systems. These factors must be considered when selecting terms for automatic indexing. In this paper, alternative choices of indexing terms are explored using both an existing electronic text collection and a newly developed collection built from images of actual printed Arabic documents. Character n-grams or lightly stemmed words were found to typically yield near-optimal retrieval effectiveness, and combining both types of terms resulted in robust performance across a broad range of conditions.",
keywords = "Arabic, Information retrieval, OCR, Term selection",
author = "Kareem Darwish and Oard, {Douglas W.}",
year = "2002",
month = "12",
day = "1",
language = "English",
pages = "261--268",
journal = "SIGIR Forum (ACM Special Interest Group on Information Retrieval)",
issn = "0163-5840",
publisher = "Association for Computing Machinery (ACM)",

}

TY - JOUR

T1 - Term selection for searching printed Arabic

AU - Darwish, Kareem

AU - Oard, Douglas W.

PY - 2002/12/1

Y1 - 2002/12/1

N2 - Since many Arabic documents are available only in print, automating retrieval from collections of scanned Arabic document images using Optical Character Recognition (OCR) is an interesting problem. Arabic combines rich morphology with a writing system that presents unique challenges to OCR systems. These factors must be considered when selecting terms for automatic indexing. In this paper, alternative choices of indexing terms are explored using both an existing electronic text collection and a newly developed collection built from images of actual printed Arabic documents. Character n-grams or lightly stemmed words were found to typically yield near-optimal retrieval effectiveness, and combining both types of terms resulted in robust performance across a broad range of conditions.

AB - Since many Arabic documents are available only in print, automating retrieval from collections of scanned Arabic document images using Optical Character Recognition (OCR) is an interesting problem. Arabic combines rich morphology with a writing system that presents unique challenges to OCR systems. These factors must be considered when selecting terms for automatic indexing. In this paper, alternative choices of indexing terms are explored using both an existing electronic text collection and a newly developed collection built from images of actual printed Arabic documents. Character n-grams or lightly stemmed words were found to typically yield near-optimal retrieval effectiveness, and combining both types of terms resulted in robust performance across a broad range of conditions.

KW - Arabic

KW - Information retrieval

KW - OCR

KW - Term selection

UR - http://www.scopus.com/inward/record.url?scp=0036993294&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=0036993294&partnerID=8YFLogxK

M3 - Article

SP - 261

EP - 268

JO - SIGIR Forum (ACM Special Interest Group on Information Retrieval)

JF - SIGIR Forum (ACM Special Interest Group on Information Retrieval)

SN - 0163-5840

ER -