A compact deep learning model for Khmer handwritten text recognition

ABSTRACT


INTRODUCTION
Khmer is an official language of Cambodia, spoken by about 16 million people. It has an Alpha syllabary (Abugida) writing structure: words are comprised of syllables, most of which consist of a radical for a consonant and additional score for vowels. The modern Khmer alphabet consists of 33 consonants.
There is a great demand for a recognition system reflecting Khmer writing specifics due to the constant accumulation of documents in such spheres as government, healthcare, finance, education. Until the early 2000 s, most records in the government and private sectors in Cambodia have been held on handwritten documents and hand-filled forms. One has to manually browse through the entire mass of paper to reach any of these records. The bulk of such tasks is extremely complex to carry out on daily basis, even with help of a systematic archiving system. Having an effective deep learning [1] application for digitizing handwritten text is particularly important for promoting the development of public and private services. Such an application also needs to be inexpensive and applicable in developing economies.
As opposed to other common alphabetical systems, there is a very small amount of research on Khmer text recognition. Most of the efforts have been done only within the past decade [2]- [8]. Sok and Taing [3] and Srun and Vyshyakov [4], [5], [7] studied recognition of the Khmer printed text. Ye et al. [8] developed an online recognition method for printed text in the Khmer, Bangla, and Myanmar alphabets. The amount of work in the field, as well as the nature of the collected data for relevant experiments, describes the current state of the art for Khmer handwritten text recognition (HTR). Most of the data used in the past  experiments were printed (Machine-derived) text, which greatly impedes the development of an accurate application.
To increase overall performance, an independent network was trained and evaluated for each class. One particular class was taken as "positive" and all othersas "negative" while training each network. That is, given a set of classes = { 1 , 2 , … . }, the samples of class were isolated and all samples of other classes 1 , 2 , … . −1 , +1 , … were considered as " " (or '). Training cable news network (CNN) with this setting yielded a classifier model (•). The output of the training process was the combination of all trained classifiers: Intuitively, the final model was designed to iterate the question "Are you of class ?" instead of asking directly "What class are you?" This work aims to design a compact model for the Khmer HTR system. Lack of appropriate datasets contributes to its difficulty. Only datasets collected in preliminary experiments [9], [10] were used.

RELATED WORK 2.1. Recognition of Khmer handwriting
Meng and Morariu [2] described how to combine feedforward artificial neural network (ANN) with a self-organizing map (SOM) to design a recognition system for printed Khmer characters. Sok and Taing [3] described their experiment with SVM on printed Khmer characters. Font size-based accuracy and CPU load were presented as efficiency assessment. Authors also listed some scarce work done towards Khmer optical character recognition (OCR) to emphasize on lack of research for the Khmer language. Backpropagation was used by Srun [4] to train a classifier to recognized Khmer characters. For the experiments, Srun sampled printed text. Preprocessing consisted of resizing images to standard dimensions. Thumwarin et al. [6] in their studies implemented finite impulse response (FIR) to extract features from handwritten Khmer characters and sent their results to a Euclidean-based classifier. The work relies on temporal information, which is impossible to collect from a scanned image of a manuscript. Another problem that the method requires extra hardware for collecting temporal information. Another work by Srun and Vishnyakov [7] included the implementation of classifiers in TESSERACT and further improvement of recognition quality of scanned characters. The earliest mention of Khmer HTR in a computerized setting dates as early as 2008 in work by Ye et al. [8], which proposed a recognition system of scripts like Myanmar, Khmer, and Bangla [8]. Research data was collected by the means of drawing characters with a mouse, which is also a drawback of the work. Unlike many previous attempts, data used in the current work reflects the nature of common handwriting which makes resultant models more realistic. Khmer datasets acquired in previous attempts are compared in Table 1.  [3] Printed and scanned text Khmer Characters, 3000 Ye et al. [8] Collected by mouse, stylus pen Khmer, 135, Myanmar Characters, 107 Thumwain et al. [6] ISSN: 2252-8938 Int J Artif Intell, Vol. 10, No. 3, September 2021: 584 -591 586 and pooling. CNNs also differ from each other in the method and objective of training, e.g., prediction, object discovery, segmentation.
According to Cun [1], [16], CNN is a variation of multilayer perceptron which require minimal preprocessing requirements. The connectivity pattern between neurons in a CNN is inspired by the biological processes of the animal visual cortex, where each cortical neuron responds to signal from only a restricted area of the visual field (receptive field). Matsugu et al. [17] described that receptive fields that connect to different neurons, partially overlap. This leads to having the entire visual field covered and, therefore, to smooth vision. Figure 1 shows an example of a three-dimensional neuron arrangement in a convolutional neural network. Every layer takes a three-channel image, where each pixel has a separate value for Red, Green, and Blue components. The image is split to form output in form of a 3D matrix of neurons. Data used in this study was preprocessed into grayscale images.

Figure 1. 3-D neuron arrangements in a CNN [18]
The convolution operation is performed on the input data. This step models the response of an individual biological neuron to visual input. The activation step applies a transformation to the output of each neuron by using activation functions. Rectified linear unit (ReLU), is an example of a commonly used activation function. It takes the output of a neuron and maps it to the highest positive value. If the output is negative, the function maps it to zero.
The output of the activation step can be further transformed by applying a pooling step. Pooling reduces the dimensionality of the feature map by condensing the output of small regions of neurons into a single output. This helps to simplify the consequent layers and reduces the number of parameters that the model needs to learn. CNN layers are configured by these three concepts. A CNN can have tens or hundreds of hidden layers that each learns to detect different features of an image. In such feature maps, every hidden layer increases the complexity of the learned image features. For example, the first hidden layer learns how to detect edges, and the last layer learns how to detect more complex shapes.
In CNN inputs from a small local receptive field (LRF) are connected to one neuron hidden layer. LRF is translated across an image to create a feature map from the input layer for being used in the hidden layers. Convolutions are used to implement this process efficiently [19]. A convolution operation is applied to the input of each layer. The convolution mimics the reaction of neurons to visual input. CNN architecture also includes pooling layers, that are used to group the outputs of one layer into a single neuron in the next layer [11], [20]. The cluster of neurons is designed in form of square batches of any size × , where = 2, 3, 4, … .
In some cases, pooling batches need to be moved beyond the boundaries of a sample image, which may cause ambiguity in the training process as well as computational and programmatic complexity. Extending the image by several rows and columns of pixels to match the size of pooling batches (padding) helps to overcome such a problem. The values used for the extra pixels may be taken differently: average overall spectrum of pixel values (average padding), zeros (zero paddings). Denoting filter size as , input size as , resulting in image size as R, padding size as , and stride size as , it is obvious that the size of the sample after each pooling layer will become is being as forms, which can also be deducted for two dimensions:

RESEARCH METHOD
Current experiments were based on the same data set and most preprocessing steps [9], [10]. Later, the potential to highly increase the recognition rate of neural networks was explored [21]. Figure 2 shows the development of the Khmer HTR framework. Data collection and preliminary experiments were completed in our previous work [9], [10]. In preliminary experiments, the number of features was reduced by 90% using three independent methods: correlation-based feature selection (CORR), two-dimensional Fourier transform (FT2D, and Gabor filters (GF). The result of each method was classified with an artificial neural network (ANN). The original data, without feature space transformation, was classified for comparison of performance. Gabor Filters yielded the highest improvement in recognition. Such a fact suggested that filters may play an important role in feature extraction. The current study is based on convolutional models, which rely on a wider variety of filters. In the course of current work, Models LeNet-5, AlexNet, VGG16, VGG19, ResNet50 have been modified for binary classification.

One-against-all tactic
Khmer samples of one consonant were taken as positive class and the ones of remaining consonants as negative class. Such practice is called a two-way classification for having only two classes to recognize from: "positive" and "negative". It has been adopted at all stages of the work. The performances of all 33 classifiers (one per each consonant) have been averaged to obtain the performance of each method. The final classification model for each method is the assembly of the classifiers as described in (1). Such a tactic was adopted since it has proven to be highly effective in comparison to direct multi-class classification in many other [22] applications. Each Khmer character has been treated based on the corresponding root radical (consonant) as a sample of that consonant. Since 17 vowels were combined with each consonant, there were 17 samples in each class.

The proposed model
This study introduces a convolutional neural net with a compact architecture: two convolutional layers and one fully-connected layer. The model is referred to as "2+1CNN", for brevity. The model is built ground-up and is initialized with random weights. 2+1CNN is based on a one-against-all tactic and designed for binary classification.

Proposed model architecture
2+1CNN was proposed as a compact model for Khmer HTR and is expected to ease the burden of computational requirements while staying close to the guidelines of previously designed successful architectures [1], [11]- [13], [23]. Local receptive fields of size 5×5 have been used in convolutional layers. Maximal pooling out of 2×2 patches has been used after each convolutional layer. Convolutional and pooling layers were kept as simple as possible to reduce the number of computations per filter. The input size was kept the same as in the previous research [11]. To prevent overfitting, 50% of the nodes in the fully connected layer are dropped out in random order. Rectified linear unit (ReLU) is used as an activation function, due to the simplicity of differentiation and its behavior close to other activation functions. Hyper-parameters used in 2+1CNN are being as: − Input images are pre-processed, resized to 224×224. − First convolutional layer with ReLU as activation function, 5×5 filters with stride size 1. − First pooling layer with 2×2 filters, stride size 1. − Second convolutional layer with ReLU activation, 5×5 filters, stride size 2. − Second pooling layer with 2×2 filters, stride size 1. − The dropout stage randomly erases 50% of the perceptron, to reduce overfitting. − The fully connected layer is made of 463 perceptron's with the ReLU activation function. The choice for the number is based on average (number of features + number of samples) / 2. Table 2 illustrates the structure of 2+1CNN. The values R, W, P, and F were obtained per (2). Filter sizes are chosen to minimize the number of computations required during model training. Figure 3 gives the visualization of a sample as it is traversed through each layer in 2+1CNN. Represented layers are input, convolution, pooling, convolution, pooling. All other models used in this work (LeNet, AlexNet, VGG16, VGG19, RESNET) were also modified so that the number of output classes was reduced to two. This modification was done to implement binary classification due to the adopted one-against-all tactic. While 2+1CNN is built ground-up, transfer learning was used to retrain the State-of-the-Art methods on Khmer samples. Due to limitations of available processing power and a high amount of data, training of all classifiers has been limited to 500 iterations.  Figure 3. Visualization of a sample within 2+1CNN

Performance evaluation
The performance of each classifier was quantified by the recognition rate on the testing data set: the ratio of the number of samples recognized correctly to the total number of samples. To ensure the robustness of each model, cross-validation was applied in four-folds. To measure the performance of an assembly of classifiers, the average of their recognition rates was taken, as per (1).

System specifications
System hardware used in experimentation: Windows 7 64 bit, 4GB RAM, Intel Core i3-220 2.20 GHz CPU. CNN architecture was implemented in Keras, with TensorFlow backend.

RESULTS AND DISCUSSION
DL models were applied to each of the existing data sets, individually and are compared by recognition rate at each data set. The main finding of the study is that the compact model 2+1CNN is highly effective. The recognition rate came out to be more than 94% on average, on par with the other models. This proves the concept of an ad-hoc CNN-based recognition system, that can be designed in a setting with low computational and capabilities. The implications are important for the applications in growing economies, like Cambodian, where developers and data engineers have limited access to high-performance technology.  Table 3 compares the hardware used in previous experiments to that of the current work. The overall comparison of models is given in Table 4. Table 5 shows the comparison of current work against previous attempts. It highlights the theoretical progress in the field of handwritten text recognition for Abugida writing systems, including Khmer. In previous attempts, data was collected either by scanning printed text or drawing with a computer mouse, which poses difficulty representing common handwriting. The results of the current HTR task were achieved on a hardware system of lesser specifications.   [3] Printed and scanned text Khmer Characters, 3000 SVM 98% Ye et al. [8] Mouse drawn, stylus Khmer, 135, Myanmar, 107 Stock methods Writing speed Thumwain et al. [6] Scanned text Khmer symbols, 6750 Distance-Based 98% Kruy and Kameyama [14] Printed and scanned text Khmer words, 1104 SIFT, distance-based 98% Meng and Morariu [2] Printed and scanned text Khmer Characters, 215 ANN 65% Kheang et al. [15] Printed and scanned text Khmer words, 110713 WFST ~73% Srun [4] Printed and scanned text Khmer Characters, 33 ANN 97% Annanurov and Noor [9], [

CONCLUSION
This work aimed to develop a compact and effective model for offline recognition of Khmer handwritten characters. In general, recognition rates came out to be 93-98%. The 2+1CNN model was built ground-up and had performance over 94%, which is at the same level as other, more sophisticated models. The results also helped towards closing the research gap in the field since, at the time of experiments, Khmer HTR has not yet been approached with deep learning. The main contribution is the compact Khmer HTR model (2+1CNN) with low computational requirements, which is based on open-source software and does not require any proprietary packages. These aspects ease its implementation, therefore, allowing swift digitization of document corpora in rural and developing areas. The developed models may be applied in a high-end OCR application targeted to the general public, as well used in more sophisticated applications with only the back-end part, aiming to digitize documents. Further works may include recognition based on the information about the layout of documents, forms, tables.