CRNN model for text detection and classification from natural scenes

ABSTRACT


INTRODUCTION
Images of natural scenes often include text that can be used for various purposes such as automatic license plate identification, image retrieval, satellite navigation, guiding robots on their way, street sign recognition, and a better understanding of the images themselves [1], [2].Although natural scene text recognition has come a long way, it is still a complex process because of factors including complicated backdrops, varying text size, color, orientation, low resolution, occlusion, environmental noise, and blur [3].
Many early deep learning scene text identification approaches [4]- [8] use bounding boxes that closely encompass the text it represents.An image classifier checks each region of interest to see if it contains any instances of text.Two-stage approaches and one-stage approaches can be distinguished among these techniques.Anchors are placed in the original image by text detectors that use two stages.The use of bounding boxes has been abandoned in favor of image segmentation, giving rise to various scene text detectors in recent years [9]- [13].SSC-Net, inspired by [14] approach to segmentation, which links all picture elements in the same instance.
Although deep learning has made strides in text recognition within natural settings [15], the bulk of the research has primarily concentrated on foreign scripts (such as Greek texts) [16], [17].The text detection and identification domain for cursive languages like Kannada, Hindi and Tamil is still in the early stages.A noticeable research gap exists for cursive languages like Kannada, Hindi, and Tamil, especially in natural show promise in recognizing cursive texts [18]- [20], the challenge is amplified with texts in various natural scenes due to their complexities and variability in backgrounds, fonts, sizes, and colors.Furthermore, much of the existing literature is confined to studying isolated characters and scripts [21]- [23].
In addition, the recognition accuracy is reduced when natural scene images contain multiple nontext elements like leaves, cursive text lines, human agents and other complex environmental conditions, as illustrated in Figures 1, 2   For these reasons, deep learning approaches have recently been developed for natural scene text recognition [24].It is worth noting that most of these innovations have been geared toward Latin scripts [25]- [27].However, text identification and recognition in natural scene photographs is still a developing topic for cursive scripts like Kannada, Tamil, and Telugu.
This paper focuses on a novel network for segmenting English and select Indian scripts like Tamil and Kannada, based on the foundational U-Net [28]  Traditional methods relied on thresholding for document binarization, using global thresholds as filters to distinguish between text and images.Techniques like sliding windows and connecting components were foundational.Sliding-window methods involved classifying various window sizes to detect text, while connected component algorithms, like MSER [29] and SWT [30], grouped pixels.Notably, Tian et al. [31] introduced minimum cost flow networks, addressing error accumulation in texts by spotting related components of candidate characters through a cascade-boosting strategy.Other notable approaches include using a 2D Gaussian kernel [32] for region variability and mathematical morphology to segregate the image background.However, traditional methods often faced limitations, particularly struggling with complex backgrounds and inconsistent brightness, leading to inconsistent results.The proliferation of vast datasets has led to advancements in DL models, including the likes of ResNet50 and VGG19 [33].Notably, generative adversarial network (GAN) by Goodfellow et al. [34] has gained prominence for image enhancement and transformation tasks.Research has shown GANs, including models like CycleGAN [35] and pix2pix-HD [36], to be instrumental in semantic segmentation and highresolution image translations.Moreover, innovative text detection strategies have emerged, with works like [37], [38] leveraging region proposal networks (RPN) for texts with varied orientations.While scholars like Xu et al. [39], and Luc et al. [40] have researched image segmentation-based text detection, the full potential of deep neural networks in learning remain a promising research area.
This research aims to address the existing gaps by employing the CRNN model that eliminates the need for segmentation by reformulating the text recognition challenge as a sequential temporal or time-based categorization task.Our findings indicate that deep convolutional neural networks (CNNs) with skip connections are more effective in feature extraction.Incorporating bidirectional recurrent mechanisms allows the model to capture extended contextual data in both forward and reverse directions.This dual-directional contextual understanding is crucial for enhancing prediction accuracy, particularly in the case of cursive writing styles where character shapes often bear similarities.Further, the proposed method is tested on a variety of Indian language scripts, and evaluated alongside state-of-the-art solutions.
Our contributions are in four folds: a) We proposed a robust approach to handling complex document images; b) A unique approach to image enhancement that incorporates CLAHE leading to robust segmentation; c) An efficient neural network for scene text image segmentation and recognition with less computational complexity; and d) Proposing a robust multi-lingual approach that recognizes Kannada, Tamil, and English texts from images, benchmarked in detail on ICDAR2015, ICDAR2017, and our contributed dataset (PDT2023).This study is subdivided into four sections: i) Section one focuses on the background, motivation, and survey of relevant literatures; ii) Section 2 is focused on the methodology; iii) The results of the experiments benchmarking with most state-of-the-art are presented in section 3; and iv) Finally, section 4 highlights the conclusion and scope for future study.

METHOD
This paper introduces a novel semantic segmentation technique [11] that quantifies the distance from the center to the edge of the text, offering a granular perspective on text structure [41] within images.Initially, we provide a comprehensive overview of the dataset selected for this study, detailing its composition and relevance to our research goals.Subsequently, we explore an in-depth discussion of the proposed methodology, highlighting how our approach innovatively addresses the challenges of text segmentation in complex visual data.This foundation sets the stage for the detailed analysis and findings presented in the subsequent sections of the paper.The proposed technique addresses challenges related to scene-text enhancement, detection and classification.Generative adversarial networks (GANs) are utilized due to their proven superiority in generating high-quality samples compared to auto-encoders.This model is designated as PDT-Net, specifically tailored for detection of texts from natural scene images containing select Indian languages.To further improve the image quality, contrast limited adaptive histogram equalization (CLAHE) was applied.It is notable that methods like Binary and Otsu thresholding were deemed unsuitable as they led to excessive noise introduction, making CLAHE the preferred choice for processing outputs.The results of preprocessing and image enhancement are presented in

Proposed methodology
The architecture of the proposed framework for recognition of English and select Indian texts from natural scenes is illustrated in Figure 7.The model utilizes VGG-16 and ResNet-18 architectures without fully connected layers for the feature extraction stage.Unlike the architecture proposed in [11], which used seven convolutional layers, the feature extraction models are augmented with skip connections to improve gradient flow.On top of the feature extraction component, a bidirectional long short-term memory (BiLSTM) layer with 256 hidden units is employed to decode feature sequences into per-frame label predictions.Finally, a connectionist temporal classification (CTC) layer is used to map the per-frame label predictions into the final output.The proposed includes four intermediate steps between the input image and the final result, namely: i) image preprocessing and enhancement using CLAHE, ii) CNN model for feature extraction, iii) feature sequence generator using gated recurrent unit and bidirected LSTM, and iv) frame predictions.The system is trainable with a single loss function and incorporates two networks (CNN and RNN).We draw inspiration from Jaderberg et al. [44], fine-tuning some hyperparameters to address the difficulties in Kannada text recognition.Equally, the framework for Kannada character recognition uses the VGG-16 [45], ResNet18 [46], and its various models, with a new proposed network comparable, but with skip connections.First, per-frame label sequence prediction is accomplished by layering a recurrent network with the feature extraction network, such as a BiLSTM, followed by a layer to map the predicted sequences to their final labels.Further, VGG-16 network was employed without linked layers, and a BiLSTM with 256 hidden units, just like in [47], and obtain segmentation accuracy comparable to most state-of-the-art model.

Feature extraction with pre-trained convolutional neural networks (CNNs)
The core of the framework is the feature extraction component, which employs multiple deep learning architectures.For effective feature extraction, it is assumed that the images are a sequence of characters.The objective is to identify the most accurate depictions of the patterns in the provided images, preserving vital data at several depths.The feature map was slightly modified to account for the horizontal direction depending on the number of textual instances, and adapted the VGG-16 architecture that integrates shortcut connections.The idea is to better capture both low-level and high-level features from the text images, which are mostly horizontally oriented in the PDT2023 dataset.
where Bi is the weighted parameter of the ith CNN model, and ai = the output feature.where xi is the output from an earlier convolutional layer.c) Residual networks (ResNet) The residual network model employs skip connections that enhance convolutional layer outputs.The original ResNet-18 contains eight residual blocks with two 3×3 kernel layers, while the modified version introduces a 1, 3, 1 kernel layer sequence.Experiments on ResNet architectures explored different shortcut strategies, with post-activation units in ResNet-18 showing superior effectiveness for the given application.

Feature map to feature sequence conversion
Recurrent neural networks (RNNs) utilize hidden layers for sequence generation but face challenges like vanishing and exploding gradients when processing extended text sequences.Long short-term memory (LSTM) [48] networks overcome these issues, offering enhanced memory recall from past inputs.Bidirectional LSTMs (BLSTMs) further refine this by having two hidden layers: one processing input sequences from past to future and another from future to past.The final layer of CNN models transforms outputs into 1D feature maps, which are segmented to produce feature vectors.In mathematical terms, the feature sequences are denoted as x = {x1, x2, … xN} where xt ϵ R 512 = the length of the feature sequences.

Per frame predictions
To train a BiLSTM, one must locate where each ground truth text's character is horizontally in a given image.This is because the BiLSTM gives a score at each time step for each horizontal position in the image.As a result of the nature of the text and the overlap, it becomes challenging to separate each character of the ground truth text in an image when using cursive scripts like Kannada and Tamil.The method proposed by [49] was employed, an approach that has been successful in many character recognition tasks.Bounding box generation is an all-important task; to that effect, algorithm 1 was proposed: Algorithm 1: The strategy employed on PDT-Net for text recognition 1: Load input images: Resize (256 x 256) 2: Perform data augmentation: ,  &  3: For  = 1,  : 4: Apply image enhancement,  ⋲ [1, ] 5: Apply CLAHE 6: PDT-Net,  : Feature extraction in Fn vector space 8: Sequence generation using RNN, and BiLSTM 9: Frame predictions: Label (P i ) ← max(Label (P i j)) 10: end do 11: Label (P i ) ← Null 12: endFor 13: return Predicted Text

The training process and experimental setup
PDT2023 dataset: To train PDT-Net using a 3×3 filter with a stride of 1, 240 samples were reserved for training and 60 for validation, corresponding to the 80:20 rule for the training and testing sets, respectively.Normalizing the input characteristics to 0 and 1 speeds up network training and convergence.In order to improve the accuracy of deep neural networks, the training dataset is expanded by incorporating a data augmentation technique that rotates the images at random (at 30, 45, and 60 degrees, respectively).
ICDAR2015, ICDAR 2017 dataset: These datasets were used as a benchmark, which contain 1670 and 18000 images respectively.When creating the training set, only the cropped versions of the images of the words produced by data augmentation were considered.Adaptive momentum (ADAM) with a learning rate set to 10e-5.The network was trained on an NVIDIA 1060 GPU with a memory of 24 GB, which analyses ten input images per batch.The experiments were conducted using Python, leveraging TensorFlow library as the backend.The approach requires 0.5 and 0.4 seconds for both the training and testing phases.

RESULTS AND DISCUSSION
In this section, we demonstrate the outcomes achieved by applying the proposed techniques.The selection of images for analysis was intentional, capturing a spectrum of lighting conditions and complexities to mirror the diverse challenges encountered in real-world scenarios.This approach ensures that our results are theoretically sound and practically applicable, providing a robust validation of the techniques' effectiveness across various environments.− Accuracy (Acc) measures the number of correct predictions to the sum of predictions.It is defined as (4).

𝐴𝑐𝑐 =
+     +   +   +   (4) − Precision addresses the question of the proportion of identifications that was correct.The criterion is expressed as ( 5): where TP = true positive.
− Sensitivity, also known as recall, accounts for the actual positives identified correctly.It is defined mathematically as ( 6): where FN = false positive.

Results of using PDT-Net across benchmark datasets
The outcome of employing PDT-Net on established benchmark datasets is presented.The selected images within these datasets encompass a wide array of illumination conditions and present numerous complexities to closely simulate actual environmental conditions.This careful curation ensures that the effectiveness of PDT-Net is thoroughly evaluated, showcasing its adaptability and robustness in handling diverse and challenging scenarios encountered in practical applications.In the experiments as presented in Table 3, different sets of images were tested to evaluate the model's performance, and subsequently evaluated on each of these dataset as presented in Table 4.
The ICDAR2015 and ICDAR2017 datasets showcased the model's adept text detection and recognition capabilities, as evidenced by annotated outputs.The custom PDT2023 dataset initially highlighted baseline performance, but after refinement, the model demonstrated increased accuracy and linguistic versatility, recognizing English, Kannada, and Tamil texts.This underscores its potential adaptability for diverse real-world linguistic scenarios.
On the ICDAR 2017, an accuracy of 98% highlights that the model performed better than when trained on other datasets.However, a precision of 98% on the ICDAR2015 proved that ICDAR2015 was better since it contained fewer images than its successor.An accuracy of 79.2% on the newly proposed dataset (PDT2023) shows the learnability of the model.

The discussion on results obtained using VGG16 and ResNet18
In this sub-section, we explored a detailed analysis of the classification performance achieved through the utilization of pre-trained models such as VGG16 and ResNet18.The comparative results, meticulously tabulated in Tables 5 through 7, clearly depict each model's efficacy in our study's context.This examination not only highlights the strengths and limitations of the employed models but also sets the groundwork for further discussion on the implications of these results for the field of image classification.

Comparative analysis
To determine the efficiency of our proposed method, we conducted a comprehensive comparative analysis, pitting our approach against established methods presented in studies [7], [39], [49], [4], and [50].This comparison aimed to benchmark the performance of the method in terms of key metrics: accuracy (Acc), precision, and recall.These metrics were computed based on the formula provided in ( 5), (6), and (7).By analyzing these metrics across different methodologies, we can offer an understanding of how our method stands relative to the state-of-the-art, highlighting its strengths and potential areas for improvement.This comparative approach ensures a robust evaluation, providing readers with a clear perspective on the advancements our research brings to the field.From Table 5, VGG-16 seems to outperform ResNet18 in terms of precision when the number of RNN units was 256.However, when the number of units was increased to 512, ResNet18 significantly surpasses VGG-16 in recall.Number of units: Increasing the number of BiLSTM units from 256 to 512 leads to a higher precision in VGG-16 but only a marginal increase in recall.This could mean that adding more units improves the model's ability to identify true positives but does not significantly improve its ability to capture all the positives (Recall).In Table 6, the experimental results on the recognition accuracy using the same number of units were reported.BiLSTM generally performs better than BiGRU when the number of units is held constant at 512.This is indicative of BiLSTM's effectiveness in capturing long-term dependencies for this specific task.
In terms of accuracy and precision, ResNet50 with BiLSTM and 512 units has the highest accuracy and precision among the models, suggesting that for a complex task like text recognition, deeper networks might be more effective.From Table 7, when compared with other relevant models, the overall performance of the PDT-Net (proposed) stands out with an accuracy of 98.2%, significantly outperforming other state-ofthe-art methods.However, its precision and recall are not the highest, suggesting that while the model is highly accurate, there may be room for improvement in its ability to correctly identify true positives (precision) and its ability to identify all positives (recall).
The method by Pratikakis et al. [50] has a very low recall, indicating that it misses a large number of true positive cases.These tables collectively offer a robust evaluation of your PDT-Net's performance in contrast to existing methods and varying configurations, providing both a validation of the approach and insights for future optimization.

CONCLUSION
The research presented offers a comprehensive examination of text recognition methodologies, focusing on varying architectures, configurations, and evaluation metrics.Our proposed PDT-Net model demonstrated superior performance, achieving an accuracy rate of 98.2%, thereby outpacing existing stateof-the-art methods.This result is indicative of the efficacy of the combined CNN-RNN architecture, which leverages the strengths of convolutional layers for feature extraction and recurrent layers for sequence modeling.
However, it is worth noting that while PDT-Net excels in terms of overall accuracy, there are areas where it could be further optimized.Specifically, its precision and recall metrics, while respectable, did not reach the upper echelons observed in some other methods.This suggests that while the model is generally reliable, it may still miss some true positives or falsely identify negatives, indicating room for refinement.
Furthermore, the comparative analysis revealed the merits and limitations of different configurations and RNN units.For instance, increasing the number of BiLSTM units from 256 to 512 led to a noticeable rise in precision for the VGG-16 model but only a marginal improvement in recall.This finding is valuable for future studies aiming to balance these metrics effectively.Finally, real-world testing and applications, as well as efforts to make the model more interpretable, could pave the way for its integration into various systems that demand high-efficiency text recognition.


ISSN: 2252-8938 Int J Artif Intell, Vol. 13, No. 1, March 2024: 839-849 840 scene text detection and identification.Even as models like convolutional recurrent neural networks (CRNN) and 3, and the highlight of the proposed model in Figures 4, 5 and 6.Figures 1, 2 and 3 illustrate variations of cursive texts in natural scenes.The yellow bounding boxes as depicted in Figures 4, 5 and 6 show the robustness of the proposed approach to outliers.

Figure 1 .Figure 2 .Figure 4 .Figure 5 .Figure 6 .
Figure 1.Non-textual objects such as green leaves and lighting superimposed on the word "Snipes" Figure 2. Non-textual objects such as a motorbike, human driver, and complex environment Figure 3.A challenging and noisy environment with poorly illuminated texts architecture.It proposes a receptive field expansion through added convolution layers for richer feature extraction.Historically, text detection and recognition have been pivotal in computer vision.Before the deep learning era, scene text detection often encompassed text extraction followed by candidate filtering.This entailed extracting texts based on predefined criteria.Int J Artif Intell ISSN: 2252-8938  CRNN model for text detection and classification from natural scenes (Puneeth Prakash) 841

Figure 7 .
Figure 7. General architecture of the proposed detection (PDT) and recognition of English and select Indian texts from natural scenes a) VGG-16 with baseline model The primary model for feature extraction in the proposed framework starts with the VGG-16 architecture.Which is adapted by adding an extra block containing two convolutional layers and one maxpooling layer to reduce the feature map's height to 1, while maintaining critical aspects of text sequence representation.Different max-pooling window sizes, specifically 2×2 and 2×1, are utilized in this version of the VGG-16 model.b) VGG-16 with skip connections Building upon the standard VGG-16, the improved version (Figure 8) introduces shortcut connections to mitigate the vanishing gradient problem.These connections allow gradients to flow more freely through the network, leading to better feature extraction capabilities.Mathematically, the output feature vector of the enhanced VGG-16 model is represented in (1):  = (, {  } +  (1)This model is based on the ensemble of convolution and pooling layers to extract features sequentially.

Table 2 Table 2 .
The results of pre-processing on the PDT 2023

Table 4 .
Performance metrics on three different datasets

Table 5 .
Text classification metrics across different pre-trained models

Table 6 .
Text recognition accuracy across different number of RNN models with the same number of units

Table 7 .
Comparison with relevant methods