Abstract
Nowadays, commercial optical character recognition (OCR) software achieves very high accuracy on high-quality scans of modern Arabic documents. However, a large fraction of Arabic heritage collections in libraries is usually more challenging - e.g. consisting of typewritten documents, early prints, and historical manuscripts. In this paper, we present our end-user oriented QATIP system for OCR in such documents. The recognition is based on the Kaldi toolkit and sophisticated text image normalization. This paper contains two main contributions: First, we describe the QATIP interface for libraries which consists of both a graphical user interface for adding and monitoring jobs and a web API for automated access. Second, we suggest novel approaches for language modelling and ligature modelling for continuous Arabic OCR. We test our QATIP system on an early print and a historical manuscript and report substantial improvements - e.g. 12.6% character error rate with QATIP compared to 51.8% with the best OCR product in our experimental setup (Tesseract).
Original language | English |
---|---|
Title of host publication | Proceedings - 12th IAPR International Workshop on Document Analysis Systems, DAS 2016 |
Publisher | Institute of Electrical and Electronics Engineers Inc. |
Pages | 168-173 |
Number of pages | 6 |
ISBN (Electronic) | 9781509017928 |
DOIs | |
Publication status | Published - 10 Jun 2016 |
Event | 12th IAPR International Workshop on Document Analysis Systems, DAS 2016 - Santorini, Greece Duration: 11 Apr 2016 → 14 Apr 2016 |
Other
Other | 12th IAPR International Workshop on Document Analysis Systems, DAS 2016 |
---|---|
Country | Greece |
City | Santorini |
Period | 11/4/16 → 14/4/16 |
Fingerprint
ASJC Scopus subject areas
- Computer Networks and Communications
- Computer Vision and Pattern Recognition
- Library and Information Sciences
Cite this
QATIP - An Optical Character Recognition System for Arabic Heritage Collections in Libraries. / Stahlberg, Felix; Vogel, Stephan.
Proceedings - 12th IAPR International Workshop on Document Analysis Systems, DAS 2016. Institute of Electrical and Electronics Engineers Inc., 2016. p. 168-173 7490112.Research output: Chapter in Book/Report/Conference proceeding › Conference contribution
}
TY - GEN
T1 - QATIP - An Optical Character Recognition System for Arabic Heritage Collections in Libraries
AU - Stahlberg, Felix
AU - Vogel, Stephan
PY - 2016/6/10
Y1 - 2016/6/10
N2 - Nowadays, commercial optical character recognition (OCR) software achieves very high accuracy on high-quality scans of modern Arabic documents. However, a large fraction of Arabic heritage collections in libraries is usually more challenging - e.g. consisting of typewritten documents, early prints, and historical manuscripts. In this paper, we present our end-user oriented QATIP system for OCR in such documents. The recognition is based on the Kaldi toolkit and sophisticated text image normalization. This paper contains two main contributions: First, we describe the QATIP interface for libraries which consists of both a graphical user interface for adding and monitoring jobs and a web API for automated access. Second, we suggest novel approaches for language modelling and ligature modelling for continuous Arabic OCR. We test our QATIP system on an early print and a historical manuscript and report substantial improvements - e.g. 12.6% character error rate with QATIP compared to 51.8% with the best OCR product in our experimental setup (Tesseract).
AB - Nowadays, commercial optical character recognition (OCR) software achieves very high accuracy on high-quality scans of modern Arabic documents. However, a large fraction of Arabic heritage collections in libraries is usually more challenging - e.g. consisting of typewritten documents, early prints, and historical manuscripts. In this paper, we present our end-user oriented QATIP system for OCR in such documents. The recognition is based on the Kaldi toolkit and sophisticated text image normalization. This paper contains two main contributions: First, we describe the QATIP interface for libraries which consists of both a graphical user interface for adding and monitoring jobs and a web API for automated access. Second, we suggest novel approaches for language modelling and ligature modelling for continuous Arabic OCR. We test our QATIP system on an early print and a historical manuscript and report substantial improvements - e.g. 12.6% character error rate with QATIP compared to 51.8% with the best OCR product in our experimental setup (Tesseract).
UR - http://www.scopus.com/inward/record.url?scp=84979529791&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84979529791&partnerID=8YFLogxK
U2 - 10.1109/DAS.2016.81
DO - 10.1109/DAS.2016.81
M3 - Conference contribution
AN - SCOPUS:84979529791
SP - 168
EP - 173
BT - Proceedings - 12th IAPR International Workshop on Document Analysis Systems, DAS 2016
PB - Institute of Electrical and Electronics Engineers Inc.
ER -