A Word & Character N-Gram based Arabic OCR Error Simulation model

Authors

  • Mostafa Ezzat Institute of Statistical Studies & Research, Cairo University, EGYPT
  • Tarek Ahmed ElGhazaly Institute of Statistical Studies & Research, Cairo University, EGYPT
  • Mervat Gheith Institute of Statistical Studies & Research, Cairo University, EGYPT

DOI:

https://doi.org/10.24297/ijct.v12i8.2999

Keywords:

Arabic OCR Degraded Text Retrieval, Arabic OCR-Degrade Text, Orthographic Query Expansion, Synthesize OCR-Degraded Text.

Abstract

This paper provides a new model aimed to enhanceArabic OCR degraded text retrieval effectiveness. The proposed model based onsimulating the Arabic OCR recognition mistakesbased on both, word based and Character N-Gram approaches. Then we expand the user search query using the expected OCR errors. The resulting search query expanded gives high precision and recall values in searching Arabic OCR-Degraded text rather than the original query. The proposed model showed a significant increase in the degraded text retrieval effectiveness over the previous models. The retrieval effectiveness of the newmodel is %93, while the best effectiveness published for word based approach was %84 and the best effectiveness for character based approach was %56.

Downloads

Download data is not yet available.

Author Biographies

Mostafa Ezzat, Institute of Statistical Studies & Research, Cairo University, EGYPT

Computer Sciences Department,

Tarek Ahmed ElGhazaly, Institute of Statistical Studies & Research, Cairo University, EGYPT

Computer Sciences Department

Mervat Gheith, Institute of Statistical Studies & Research, Cairo University, EGYPT

Computer Sciences Department

Downloads

Published

2014-02-22

How to Cite

Ezzat, M., ElGhazaly, T. A., & Gheith, M. (2014). A Word & Character N-Gram based Arabic OCR Error Simulation model. INTERNATIONAL JOURNAL OF COMPUTERS &Amp; TECHNOLOGY, 12(8), 3758–3767. https://doi.org/10.24297/ijct.v12i8.2999

Issue

Section

Research Articles