QATIP - An Optical Character Recognition System for Arabic Heritage Collections in Libraries

Felix Stahlberg, Stephan Vogel

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Nowadays, commercial optical character recognition (OCR) software achieves very high accuracy on high-quality scans of modern Arabic documents. However, a large fraction of Arabic heritage collections in libraries is usually more challenging - e.g. consisting of typewritten documents, early prints, and historical manuscripts. In this paper, we present our end-user oriented QATIP system for OCR in such documents. The recognition is based on the Kaldi toolkit and sophisticated text image normalization. This paper contains two main contributions: First, we describe the QATIP interface for libraries which consists of both a graphical user interface for adding and monitoring jobs and a web API for automated access. Second, we suggest novel approaches for language modelling and ligature modelling for continuous Arabic OCR. We test our QATIP system on an early print and a historical manuscript and report substantial improvements - e.g. 12.6% character error rate with QATIP compared to 51.8% with the best OCR product in our experimental setup (Tesseract).

Original languageEnglish
Title of host publicationProceedings - 12th IAPR International Workshop on Document Analysis Systems, DAS 2016
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages168-173
Number of pages6
ISBN (Electronic)9781509017928
DOIs
Publication statusPublished - 10 Jun 2016
Event12th IAPR International Workshop on Document Analysis Systems, DAS 2016 - Santorini, Greece
Duration: 11 Apr 201614 Apr 2016

Other

Other12th IAPR International Workshop on Document Analysis Systems, DAS 2016
CountryGreece
CitySantorini
Period11/4/1614/4/16

Fingerprint

Optical character recognition
normalization
user interface
Graphical user interfaces
Application programming interfaces (API)
monitoring
Interfaces (computer)
language
Monitoring

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Computer Vision and Pattern Recognition
  • Library and Information Sciences

Cite this

Stahlberg, F., & Vogel, S. (2016). QATIP - An Optical Character Recognition System for Arabic Heritage Collections in Libraries. In Proceedings - 12th IAPR International Workshop on Document Analysis Systems, DAS 2016 (pp. 168-173). [7490112] Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/DAS.2016.81

QATIP - An Optical Character Recognition System for Arabic Heritage Collections in Libraries. / Stahlberg, Felix; Vogel, Stephan.

Proceedings - 12th IAPR International Workshop on Document Analysis Systems, DAS 2016. Institute of Electrical and Electronics Engineers Inc., 2016. p. 168-173 7490112.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Stahlberg, F & Vogel, S 2016, QATIP - An Optical Character Recognition System for Arabic Heritage Collections in Libraries. in Proceedings - 12th IAPR International Workshop on Document Analysis Systems, DAS 2016., 7490112, Institute of Electrical and Electronics Engineers Inc., pp. 168-173, 12th IAPR International Workshop on Document Analysis Systems, DAS 2016, Santorini, Greece, 11/4/16. https://doi.org/10.1109/DAS.2016.81
Stahlberg F, Vogel S. QATIP - An Optical Character Recognition System for Arabic Heritage Collections in Libraries. In Proceedings - 12th IAPR International Workshop on Document Analysis Systems, DAS 2016. Institute of Electrical and Electronics Engineers Inc. 2016. p. 168-173. 7490112 https://doi.org/10.1109/DAS.2016.81
Stahlberg, Felix ; Vogel, Stephan. / QATIP - An Optical Character Recognition System for Arabic Heritage Collections in Libraries. Proceedings - 12th IAPR International Workshop on Document Analysis Systems, DAS 2016. Institute of Electrical and Electronics Engineers Inc., 2016. pp. 168-173
@inproceedings{d951edb0854743809e587d03a2eef0e4,
title = "QATIP - An Optical Character Recognition System for Arabic Heritage Collections in Libraries",
abstract = "Nowadays, commercial optical character recognition (OCR) software achieves very high accuracy on high-quality scans of modern Arabic documents. However, a large fraction of Arabic heritage collections in libraries is usually more challenging - e.g. consisting of typewritten documents, early prints, and historical manuscripts. In this paper, we present our end-user oriented QATIP system for OCR in such documents. The recognition is based on the Kaldi toolkit and sophisticated text image normalization. This paper contains two main contributions: First, we describe the QATIP interface for libraries which consists of both a graphical user interface for adding and monitoring jobs and a web API for automated access. Second, we suggest novel approaches for language modelling and ligature modelling for continuous Arabic OCR. We test our QATIP system on an early print and a historical manuscript and report substantial improvements - e.g. 12.6{\%} character error rate with QATIP compared to 51.8{\%} with the best OCR product in our experimental setup (Tesseract).",
author = "Felix Stahlberg and Stephan Vogel",
year = "2016",
month = "6",
day = "10",
doi = "10.1109/DAS.2016.81",
language = "English",
pages = "168--173",
booktitle = "Proceedings - 12th IAPR International Workshop on Document Analysis Systems, DAS 2016",
publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

TY - GEN

T1 - QATIP - An Optical Character Recognition System for Arabic Heritage Collections in Libraries

AU - Stahlberg, Felix

AU - Vogel, Stephan

PY - 2016/6/10

Y1 - 2016/6/10

N2 - Nowadays, commercial optical character recognition (OCR) software achieves very high accuracy on high-quality scans of modern Arabic documents. However, a large fraction of Arabic heritage collections in libraries is usually more challenging - e.g. consisting of typewritten documents, early prints, and historical manuscripts. In this paper, we present our end-user oriented QATIP system for OCR in such documents. The recognition is based on the Kaldi toolkit and sophisticated text image normalization. This paper contains two main contributions: First, we describe the QATIP interface for libraries which consists of both a graphical user interface for adding and monitoring jobs and a web API for automated access. Second, we suggest novel approaches for language modelling and ligature modelling for continuous Arabic OCR. We test our QATIP system on an early print and a historical manuscript and report substantial improvements - e.g. 12.6% character error rate with QATIP compared to 51.8% with the best OCR product in our experimental setup (Tesseract).

AB - Nowadays, commercial optical character recognition (OCR) software achieves very high accuracy on high-quality scans of modern Arabic documents. However, a large fraction of Arabic heritage collections in libraries is usually more challenging - e.g. consisting of typewritten documents, early prints, and historical manuscripts. In this paper, we present our end-user oriented QATIP system for OCR in such documents. The recognition is based on the Kaldi toolkit and sophisticated text image normalization. This paper contains two main contributions: First, we describe the QATIP interface for libraries which consists of both a graphical user interface for adding and monitoring jobs and a web API for automated access. Second, we suggest novel approaches for language modelling and ligature modelling for continuous Arabic OCR. We test our QATIP system on an early print and a historical manuscript and report substantial improvements - e.g. 12.6% character error rate with QATIP compared to 51.8% with the best OCR product in our experimental setup (Tesseract).

UR - http://www.scopus.com/inward/record.url?scp=84979529791&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84979529791&partnerID=8YFLogxK

U2 - 10.1109/DAS.2016.81

DO - 10.1109/DAS.2016.81

M3 - Conference contribution

SP - 168

EP - 173

BT - Proceedings - 12th IAPR International Workshop on Document Analysis Systems, DAS 2016

PB - Institute of Electrical and Electronics Engineers Inc.

ER -