A Printed PAW Image Database of Arabic Language for Document Analysis and Recognition

Bilal Bataineh

Abstract


Document image analysis and recognition are important topics in the field of artificial intelligence. In this context, the availability of a database with good script samples is an important requirement for machine-learning processes. For Latin and Asian languages many suitable databases exist. However, there is a shortage of databases with Arabic samples. In this work, a new database of printed Arabic text is introduced. The new concept of collecting sub-words (PAWs) instead of words or individual character samples was adopted. These PAWs constitute all words in the Arabic language. The collected database consists of 83,056 images of PAWs extracted from approximately 550,000 different words. Each sample is presented in the database in five font types: Thuluth, Naskh, Andalusi, Typing Machine, and Kufi. In total, the database consists of 415,280 images. Moreover, ground truth information is included with each PAW image to describe its occurrence number, occurrence frequency, positions and the shapes of the characters. This paper presents a statistical analysis of the frequency of each PAW in the Arabic language.

Keywords


Arabic language; database; document images; information retrieval; OCR; PAWs.

Full Text:

PDF

References


Abu-Ain, T., Abdullah, S.N.H.S., Bataineh, B., Omar, K. & Abu-Ein, A., A Novel Baseline Detection Method of Handwritten Arabic-script Documents Based on Sub-Words, Soft Computing Applications and Intelligent Systems, Springer. Shah Alam, pp. 67-77, 2013.

Bataineh, B., Abdullah, S.N.H.S. & Omar, K., Adaptive Binarization Method for degraded Document Images Based on Surface Contrast Variation, Pattern Analysis and Applications, pp. 1-14, 2015.

Breuel, T.M., Ul-Hasan, A., Al-Azawi, M.A. & Shafait, F., High-performance OCR for Printed English and Fraktur using LSTM Networks, 12th International Conference on Document Analysis and Recognition, IEEE. Washington, DC, 2013.

Breuel, T.M., Ul-Hasan, A., Al-Azawi, M.A. & Shafait, F., Document Image Quality Assessment Based on Texture Similarity Index, IEEE Workshop on Document Analysis Systems (DAS), 12th IAPR, Santorini, Greece, 2016.

Bataineh, B., Abdullah, S.N.H.S. & Omar, K., A Novel Statistical Feature Extraction Method for Textual Images: Optical Font Recognition, Expert Systems with Applications, 39(5), pp. 5470-5477, 2012.

Ntirogiannis, K., Gatos, B. & Pratikakis, I, ICFHR2014 Competition on Handwritten Document Image Binarization (H-DIBCO 2014), IEEE 14th International Conference on Frontiers in Handwriting Recognition, Crete Island, Greece, pp. 809-813, 2014.

Yalniz, I.Z. & Manmatha, R, A Fast Alignment Scheme for Automatic OCR Evaluation of Books, IEEE International Conference on Document Analysis and Recognition, Beijing, China, 2011.

Ramdan, J., Omar, K., Faidzul, M. & Mady, A., Arabic Handwriting Data Base for Text Recognition, Procedia Technology, 11, pp. 580-584. 2013.

Mezghani, A., Kanoun, S., Khemakhem, M. & El Abed, H., A Database for Arabic Handwritten Text Image Recognition and Writer Identification, IEEE International Conference on Frontiers in Handwriting Recognition (ICFHR), Bari, Italy, 2012.

Alginahi, Y.M., A Survey on Arabic Character Segmentation, International Journal on Document Analysis and Recognition (IJDAR), 16(2), pp. 105-126, 2013.

AbdelRaouf, A., Higgins, C.A. & Khalil, M, A Database for Arabic Printed Character Recognition, in International Conference Image Analysis and Recognition, Springer, Portugal, 2008.

Al-Fassi, A., Al-Tanji, M. & Al-Ashbili, A.B., The Conclusion of Vision, Ministry of Endowments and Islamic Affairs, Morocco, 1963. (Text in Arabic)

Al-Raafiy, M.S., The History of Arab Etiquette, Dar al-Kitab al-Arabi, Egypt, 1997. (Text in Arabic)

Gaddour, H., Kanoun, S. & Vincent, N., A New Method for Arabic Text Detection in Natural Scene Image Based on the Color Homogeneity, in International Conference on Image and Signal Processing, Trois-Rivières, Canada, Springer, 2016.

Marti, U.V. & Bunke, H., A Full English Sentence Database for Off-Line Handwriting Recognition, Proceedings of the Fifth IEEE International Conference on Document Analysis and Recognition, ICDAR'99, Bangalore, India, 1999.

Marti, U.V. & Bunke, H., The IAM-database: An English Sentence Database for Offline Handwriting Recognition, International Journal on Document Analysis and Recognition, 5(1), pp. 39-46, 2002.

LeCun, Y., Bottou, L., Bengio, Y. & Haffner, P., Gradient-based Learning Applied to Document Recognition, Proceedings of the IEEE, 86(11), pp. 2278-2324, 1998.

Nagy, R., Dicker, A. & Meyer-Wegener, K., NEOCR: A Configurable Database for Natural Image Text Recognition, in International Workshop on Camera-Based Document Analysis and Recognition, Springer, Beijing, China, 2011.

Kang, L., Doermann, D., Cao, H., Prasad, R. & Natarajan, P., Local Segmentation of Touching Characters Using Contour Based Shape Decomposition, 10th IAPR International Workshop on Document Analysis Systems (DAS), IEEE, Queensland, Australia, 2012.

Antonacopoulos, A., Bridson, D., Papadopoulos, C. & Pletschacher, S., A Realistic Database for Performance Evaluation of Document Layout Analysis, 10th International Conference on Document Analysis and Recognition, IEEE. Barcelona, Spain, 2009.

Kharma, N., Ahmed, M. & Ward, R, A new Comprehensive Database of Handwritten Arabic Words, Numbers, and Signatures Used for OCR Testing, IEEE Canadian Conference in Electrical and Computer Engineering, Alberta, Canada, 1999.

Pechwitz, M., Maddouri, S.S., Märgner, V., Ellouze, N. & Amiri, H., IFN/ENIT-Database of Handwritten Arabic Words, in Proc. of CIFED, Citeseer, 2, pp. 127-136, 2002.

Mozaffari, S., El Abed, H., Märgner, V., Faez, K. & Amirshahi, A., IfN/Farsi-Database: A Database of Farsi Handwritten City Names, International Conference on Frontiers in Handwriting Recognition, 2008.

Schlosser, S., ERIM Arabic Document Database, Environmental Research Institute of Michigan, 2002.

Doermann, D. & Jaeger, S., Arabic and Chinese Handwriting Recognition, Springer-Verlag Berlin Heidelberg, 2006.

Al-Zoubaidy, M. The Bride Crown from the Jewels of Dictionaries, Arabic-arabic dictionaries, Dar Al Hedaya, Damaskus, Lebanon, 1965. (Text in Arabic)

Sanusi, M.A., A Novel Feature from Combinations of Triangle Geometry for Digital Jawi Paleography, PhD Dissertation, Department of Computer Science, University Kebangsaan Malaysia, 2013.

Bataineh, B., Abdullah, S.N.H.S. & Omar, K., A Novel Statistical Feature Extraction Method for Textual Images: Optical font recognition. Expert Systems with Applications, 39(5), pp. 5470-5477. 2012.




DOI: http://dx.doi.org/10.5614%2Fitbj.ict.res.appl.2017.11.2.6

Refbacks

  • There are currently no refbacks.


Contact Information:

ITB Journal Publisher, LPPM – ITB, 

Center for Research and Community Services (CRCS) Building Floor 7th, 
Jl. Ganesha No. 10 Bandung 40132, Indonesia,

Tel. +62-22-86010080,

Fax.: +62-22-86010051;

e-mail: jictra@lppm.itb.ac.id.