QATIP - An Optical Character Recognition System for Arabic Heritage Collections in Libraries

Felix Stahlberg, Stephan Vogel

Research output: Chapter in Book/Report/Conference proceedingConference contribution

1 Citation (Scopus)

Abstract

Nowadays, commercial optical character recognition (OCR) software achieves very high accuracy on high-quality scans of modern Arabic documents. However, a large fraction of Arabic heritage collections in libraries is usually more challenging - e.g. consisting of typewritten documents, early prints, and historical manuscripts. In this paper, we present our end-user oriented QATIP system for OCR in such documents. The recognition is based on the Kaldi toolkit and sophisticated text image normalization. This paper contains two main contributions: First, we describe the QATIP interface for libraries which consists of both a graphical user interface for adding and monitoring jobs and a web API for automated access. Second, we suggest novel approaches for language modelling and ligature modelling for continuous Arabic OCR. We test our QATIP system on an early print and a historical manuscript and report substantial improvements - e.g. 12.6% character error rate with QATIP compared to 51.8% with the best OCR product in our experimental setup (Tesseract).

Original languageEnglish
Title of host publicationProceedings - 12th IAPR International Workshop on Document Analysis Systems, DAS 2016
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages168-173
Number of pages6
ISBN (Electronic)9781509017928
DOIs
Publication statusPublished - 10 Jun 2016
Event12th IAPR International Workshop on Document Analysis Systems, DAS 2016 - Santorini, Greece
Duration: 11 Apr 201614 Apr 2016

Other

Other12th IAPR International Workshop on Document Analysis Systems, DAS 2016
CountryGreece
CitySantorini
Period11/4/1614/4/16

    Fingerprint

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Computer Vision and Pattern Recognition
  • Library and Information Sciences

Cite this

Stahlberg, F., & Vogel, S. (2016). QATIP - An Optical Character Recognition System for Arabic Heritage Collections in Libraries. In Proceedings - 12th IAPR International Workshop on Document Analysis Systems, DAS 2016 (pp. 168-173). [7490112] Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/DAS.2016.81