Segmentation of Handwritten Text Document Written in Devanagri Script for Simple character, skewed character and broken character

OCR (optical character recognition) is a technology that is commonly used for recognizing patterns artificial intelligence & computer machine. With the help of OCR we can convert scanned document into editable documents which can be further used in various research areas. In this paper, we are presenting a character segmentation technique that can segment simple characters, skewed characters as well as broken characters. Character segmentation is very important phase in any OCR process because output of this phase will be served as input to various other phase like character recognition phase etc. If there is some problem in character segmentation phase then recognition of the corresponding character is very difficult or nearly impossible.


Introduction
OCR is a technology that enables us to convert different types of scanned document into editable documents. It is a part of electronic document Analysis system. It is used to extract text from scanned images of type written, handwritten or printed text. Process of OCR can be described as following:

Character Segmentation
Character segmentation is a procedure in which from the word segmentation we take out only characters. Character segmentation is a critical step of OCR system. Character segmentation is an operation that seeks to decompose an image of a sequence of characters into sub images of individual symbols. It is depends on the script used in writing the document. A poor segmentation process produces misrecognition or rejection segmentation process carried after out only the pre processing of image.

Broken Character
Character can be broken due to writer's pen or page quality used. The following figure shows the broken character Feature Extraction J u n e 1 5 , 2 0 1 3

Fig 4: Broken Character in a Word
Segmentation of the broken character is quite difficult because vertical profile projection technique assumes the broken parts of the characters as individual characters.
The following diagram shows a wrong segmented word due to broken character problem:

Fig 5: wrongly segmented word with broken character
Solution: Broken character can be segmented by scanning the neighboring pixels before segmenting the word into characters. Neighboring pixels on both left and right side are to be checked and if the black pixels are there then that represents the character is broken and not to be segmented but if there are white pixels in its neighbor then these pixels are treated as a gap and hence to be segmented.
The following figure shows correctly segmented broken character by using above solution

Skewed Character
Characters in a word may have slant either upward or downward which results in the skewed characters. Skewed characters are generated due to the writing skills of a person.The following figure shows a skewed character

Fig7: A Skewed Word
While segmenting a skewed character, the problem arises in detection of the header line of the word which results in improper segmented word.
The following diagram shows a improper segmented word due to skewed character. J u n e 1 5 , 2 0 1 3

Our Approach
Our algorithm to segment the characters which may be skewed or broken have following steps: Step1. Scan the document into image from which words are to be segmented into characters Step 2: Binarize the scanned image Step 3: Remove the noise from the binarized document Step 4: Extract the line from which we want to segment the words Step 5: Calculate the frequency of black pixels in each row along with neighbors using horizontal profile projection technique.
Step 6: Find the row with the highest numbers of black pixels and treat that row as header row.
Step 7: Remove that header row from the word for segmentation Step 8: Using vertical profile projection technique parse the word column wise Step 9: Check for each i th column of the word if all the pixels are white and if so then check i-1 and i+1 number of pixels. If all three pixels are white then treat them as gap between two characters and then segment the word. But if either of the two pixels (i-1, i+1) is black, than it represents the broken character and don't segment the word from the i th pixel.

Experiments and results
We have tested our algorithm on 30 documents of different writings. Our system shows accuracy of 96%. The results after applying the algorithm are as follows: J u n e 1 5 , 2 0 1 3 Result Fig. 1 Result Fig. 2

Conclusion
From the result table I we can say that the new system is giving very good results. Our System to segment the simple words shows the accuracy of 100 % and on skewed words System shows the accuracy of 95% while on the broken characters it shows accuracy of 96%. System can be extended to segment the words with overlapped and/or touching characters.