A Multi-Stage Hybrid Model for Odia Compound Character Recognition

Dibyasundar Das, Deepak Ranjan Nayak, Ratnakar Dash, and Banshidhar Majhi


Optical character recognition (OCR) for Indie languages has recently gained increasing popularity due to its high usage in many real-time applications. Most of the earlier works have shown higher recognition performance for languages like English, Japanese, and Chinese. However, the performance for Indie languages is still far away from real-time demands. Odia is one of the constitutional languages of India, specifically used in the states of Odisha, Jharkhand, and West Bengal. There are about 27 million people who use Odia as their basic language of communication. The major challenge in modeling an Odia OCR is the detection of compound characters with higher accuracy. Odia script contains more than 400 character classes that include vowels, consonants, matras (vowel allographs), compound characters

(consonant allographs), some special symbols and numerals. The proposed method divides the character recognition task into 211 classes of characters and detects the class label using a three-stage hybrid architecture. It is worth mentioning here that the method is applied only to Odia printed characters.

OCR development has a rich history [8, 30, 34, 37]. Starting from OCR machine to modern digital software-based recognition systems, the main aim is to recognize and classify characters irrespective of position, font size, and their orientation in the visual field. Now with modern devices, making the detection font style independent has also become an interesting field of research. Broadly an OCR system has two steps, namely, character extraction and character detection. The extraction process involves image enhancement [1], skew and error correction, binarization [18, 19, 42^14], line and word separation, and character separation. The image enhancement and error removal techniques aim to make the image ready for the OCR engine. These steps are generally referred to as the pre-processing step as a whole, which can be followed by a font detection process [31]. Font detection helps to improve accuracy by utilizing font-specific detection algorithms. So far little work has been reported in relation to printed Odia character recognition, and in most of these cases only primary characters [7, 29] have been considered. The study of basic and compound characters in many of the Indian languages like Bangla [6, 11] and Devnagari [34] has been carried out by many researchers. Chaudhuri et al. [6] have studied 75 basic and modified characters of the Bangla language and achieved 96% accuracy. Following this, 300 numbers of allographs of the Bangla script w'ere studied by Garain et al. [17] using the run-number-based hybrid method and showed 99.69% of accuracy over a self-collected dataset, while in [34], an accuracy of 96.00% was earned for Devanagari character recognition. However, the performance is yet to be improved significantly for other Indian scripts like Kannada [3], Gurumukhi [21]. Telugu [46], Tamil [9], Gujarati [2. 16], and Odia [7. 29].

Many studies have been carried out for handwritten Odia OCR. Padhi et al. [32, 33] designed features based on zone segmentation and its statistical information, and used ANN to classify 49 allographs of Odia script, with an accuracy of 94% achieved on a self-collected handwritten character dataset. Pal et al. [36] studied 52 Odia characters using gradient and curvature features with a quadratic classifier to achieve an accuracy of 91.11%. Kumar et al. [20] proposed a model using an Ant miner algorithm on 50 classes of Odia handwritten characters and achieved 90% accuracy. Basa et al. [4, 25] designed a two-stage classification system. They proposed a tree based method to classify Odia handwritten characters into two groups and further used discriminant features of each group learned by two different ANNs. They formed a handwritten Odia character dataset of 51 classes and showrd that their method provided an overall accuracy of 90%. Das et al. [10] proposed a recognition model for Odia characters using an extreme learning machine (ELM). They analyzed various parameters of ELM and their effect on accuracy for various datasets.

Similarly, many works have been published on handwritten Odia numerals which are summarized as follows. One of the earliest works on handwritten Odia numeral recognition is reported by Sarangi et al. [41 ] where a Hopfield neural network is used to achieve 95.4% accuracy. Following this, Sarangi et al. [39] proposed a recognition model by use of a lower and upper triangular matrix (LU) feature w ith a multi-layer perceptron network, which gives 85.3% accuracy. Bhowmik et al. [5] reported recognition of Odia numerals with a hidden Markov model and the occurrence of strokes in characters. They obtained an accuracy of 90.5% from the experiment. Roy et al. [38] studied the histogram block-based feature for handwritten numeral recognition and achieved an accuracy of 94.81%. Sarangi et al. [40] obtained 85.3% accuracy using LU factorization as a feature extractor and a Naive Bayes classifier. Dash et al. [12] used a curvature feature and Kirsh edge operator with two well-known classifiers such as a discriminative-learning quadratic-discriminant function and a modified quadratic-discriminant function to achieve an accuracy of 98.4% and 98.5% respectively. Mishra et al. [26] investigated DCT and DWT features independently to achieve 87.5% and 92% accuracy respectively. Dash et al. [13, 14] proposed an Odia numeral recognition system using a zone-based non-redundant Socketwell transform and a к-nearest neighbor (k-NN) classifier to obtain an accuracy of 98.80%. Mahto et al. [24] designed a method where a simple ANN with a Quadrant-mean-based feature could achieve a 93.2% accuracy. Mishra et al. [28] used a contour-based feature with HMM to yield 96.3% accuracy on an Odia numeral dataset. Mishra et al. [27] employed using cord length and an angle feature for a handwritten Odia numeral recognition model. Dash et al. [15] used binary external- symmetry-axis-based features and classified a self-collected handwritten numeral dataset with an accuracy of 95%.

From this literature survey, it is evident that most of the research is limited to the detection of only 51 basic Odia characters and ten numeric characters. However, Odia has more than 400 classes of characters which are generated by combining one or more of them. Some of the sample characters and their category are given in Figure 3.1. Further. Most of the aforementioned works are dedicated in detecting 11 vowels.

35 consonants, and 10 numeric values. However, the synthesis of most Odia sentences is not possible without the use of martas and compound characters. Odia OCR needs to detect basic as well as compound characters to convert Odia scanned documents to their corresponding text. We studied the various literature and found that 211 classes of characters are ample for recognizing most of the documents. Our research focused on these 211 characters. The current models are not efficient enough to recognize all these characters, as shown in Section 3.5. Hence, aiming at improving recognition performance, we proposed a hybrid model which utilizes a structural similarity index (SSIM) in the first stage, a projection profile and Kendall rank correlation coefficient ranking in the second stage, and a local frequency descriptor (LFD) and General regression neural network (GRNN) in the final stage to predict the final class label. Different combinations of these methods were tested over a newly created dataset and a comparative analysis was made with state- of-the-art methods. An overall accuracy of 90.6% was achieved using the proposed three-stage hybrid model.

The chapter is organized in the following way. Section 3.2 gives a description of the methods that are required to build the hybrid model. Section 3.3 details the proposed model and Section 3.4 describes the dataset developed and the experimental details. Following this, in Section 3.5, results of this investigation are presented, and finally Section 3.6 concludes the chapter.

General OCR Stages

Generally, a complete OCR process is divided into multiple stages. Image acquisition is the first and essential step for any computer vision application. For character recognition images are captured by a flatbed scanner. A digital image of the document can also be obtained with powerful mobile cameras, though they need additional processing. After image acquisition, the known errors are to be removed by the pre-processing step. Image enhancement, noise removal, skew detection and correction, page segmentation, binarization. and font detection are examples of a few pre-processing methods that are often used in OCR models. Segmentation is the process of separating lines, words, and characters and has higher significance in determining the recognition accuracy in the case of handwritten and touched character recognition. Each separated character is fed into the recognition model to determine the class label. Models can be derived by various learning and classification methods and can be divided broadly into the two categories of template matching and feature matching hybrid models (feature + template). After recognizing the character class, its corresponding ASCII or UNI-code must be written into a text file. Handling both word and line separation is essential in this stage. The information obtained from the segmentation step is used in this context, though the recognition process may not lead to the desired result. The presence of punctuation and errors may also result in some unnecessary characters, which can be solved with postdictionary matching and contextual information. An overview of the general model is shown in Figure 3.2.

An overview of the OCR process

FIGURE 3.2 An overview of the OCR process.

Structural Similarity

Human eyes focus on the structure of an image rather than its illumination. The SSIM [47] exploits this principle and uses luminance, contrast, and structure comparison to give ranking value to image quality that is very similar to a subjective human evaluation standard. This is expressed mathematically in Equations 3.1 and 3.2. The overall presentation of the SSIM evaluation is shown in Figure 3.3.

where s denotes structure, c denotes contrast, and / indicates local luminance. The expression correlates to the human visual system and by taking a = /3 = у = 1 it can be simplified as

where Cj = (A", * L)2, C2 = (K2 * L)2, and L denotes the highest intensity level. The values of and K2 are computed by philosophical study and are set to 0.01 and 0.03 respectively.

SSIM evaluates images on a structural domain rather than an error domain. Hence, for each template image, the score with respect to the target image becomes different. It ranges from 0 to 1 and a higher value indicates better matching. It has been observed from experiment that the recognition performance of all 211 classes is very high for

Comparison of SSIM values of two template images

FIGURE 3.4 Comparison of SSIM values of two template images.

the font that is similar to the template. But in the case of other fonts, the actual class can be found within the best 10-15 scores. For a fair evaluation, the best 20 characters that have high SSIM values are chosen and these are used in the second stage to refine the results. Figure 3.4 is an example of a working of SSIM on separating similar characters. Figure 3.4(a) shows how the target character 21 gives different values for four alike templates that often creates confusion. An equivalent example for a compound character Я is given in Figure 3.4(b).

Projection Profile and Kendall Rank Correlation Coefficient Matching

The projection profile analysis refers to the matching of the top, bottom, left, and right projection of the black pixel count in the corresponding character image. The projection profile of a sample character is depicted in Figure 3.5. With the change in font, the location of the outer ovals and the depression changes but the order in which they appear remains constant. The analysis of the projection can hence be considered

a rank-order-matching problem. For this, the Kendall rank correlation coefficient (KRCC) is used and Equation 3.3 describes KRCC values. Here, the target vector is formed by collecting the top, left, bottom, and right projection sequentially. The image is normalized to 60 x 60, hence the vector size becomes 1 x 240.

where N is the number of concordant pairs and Ndp is the number of discordant pairs. Let {X,,ЛГ2,...,} be the observed/target vector and {F,. K2,.... F„) be the reference vector, then the joint random variable is {(X,, У,), (X,, K2), ... ,(X„, У„)}. A set of observations (X,, Yt) and (X,, Y) is said to be concordant if both X, > X, and Yt > Yf or X, < Xj and Y{ < Yj. Similarly a pair (Xd Yj) and (Xj, Yj) is discordant if Xi > X; and Y, < Yj or Xj < Xj and Y: > Yj. If X, = X( or Y( = Y} then the pair is neither concordant nor discordant. The KRCC value varies from -1 to 1, where -1 represents being completely out of agreement, 1 represents being exactly the same, and zero represents the two vectors being independent. For matching characters, the value closest to 1 is preferred.

Local Frequency Descriptor

A local edge descriptor [23] has been derived from LFD [22]. Here we consider neighbor pixels with radius R. Let LCF = {/j, ...,/P] be the set of neighborhood pixels that are located around the centroid. Hence LFD output can be expressed as:

where LFD(«) is a vector of 1 x P corresponding to the frequency component of the neighborhood pixels in LCF. Among P components, LDF(2) contains the most edge information. Hence the local frequency descriptor gradient (LFDG) is calculated by setting n = 2:

LFDG is a complex value; therefore to use it in a neural network Equation 3.5 can be decomposed into real and imaginary parts as:

Hence for each R the corresponding LFDG size is 1 x 2. For feature evaluation the

character image is normalized to 60 x 60. LFDG is evaluated for R = {1,2.....30)

around its centroid, which makes the feature size to be 1 x 60.

General Regression Neural Network (GRNN)

The GRNN [45] was developed by Specht in 1991 and the main advantage of this network is that it needs no training. For a given training set (дг„у(), the estimation of unknown dependent variable y,„, from independent variable x,es, is given by:

where a is the smoothing factor (here chosen as yfl ), p is dimension X, n is the size of the training set, A, is the distance of xles! to each respective xt. Here Euclidean distance is used for calculating A,, and is expressed as:

shows the overall model of a GRNN network, which can handle the dynamical nature of the previous stage prediction

Figure 3.6 shows the overall model of a GRNN network, which can handle the dynamical nature of the previous stage prediction. When multi-labels are predicted, the prediction is not always consistent, depending on the nature of the input. GRNN can compare and predict the output label in a single pass with high accuracy. This is based on the per-calculated samples from each class; hence it can handle noise. To fix the sample size we performed к-means on each class with the к value as 50. Fifty clustering centers are stored for each class to be used as the pattern input to the GRNN model.

< Prev   CONTENTS   Source   Next >