Performance analysis of optimization algorithms for convolutional neural network-based handwritten digit recognition

Handwritten digit recognition has been widely researched by the recognition society during the last decades. Deep convolutional neural networks (CNN) have been exploited to propose efficient handwritten digit recognition approaches. However, the CNN model may need an optimization algorithm to achieve satisfactory performance. In this work, a performance evaluation of seven optimization methods applied in a straightforward CNN architecture is presented. The inspected algorithms are stochastic gradient descent (SGD), adaptive gradient (AdaGrad), adaptive delta (AdaDelta), adaptive moment estimation (ADAM), maximum adaptive moment estimation (AdaMax), nesterov-accelerated adaptive moment estimation (Nadam), and root mean square propagation (RM-Sprop). Experiments have been carried out on two standard digit datasets, namely Modified National Institute of Standards and Technology (MNIST) and Extended MNIST (EMNIST). The results have shown the superior performance of RMSprop and Adam algorithms over the peer methods, respectively.


INTRODUCTION
Although visual recognition tasks come naturally to humans, they remain challenging and sophisticated for machine autonomy [1].Handwritten digit recognition is one of the automated image classification and recognition problems that received considerable attention by researchers over the past few years [2].The importance of this technology stems from its numerous applications, for example optical character recognition (OCR), signature verification, text interpretation, manipulation, and many more [3].Nowadays, deep learning techniques play a significant role in solving handwritten digit recognition and other image recognition problems, such as medical imaging, especially when the dataset size is large [4]- [6].However, deep neural networks are very complex models that often require the estimation of a large number of parameters.Assigning inappropriate values to these parameters may negatively affect the network's convergence speed and accuracy.Optimization algorithms have been proposed to assign typical values to the network parameters and improve the overall performance [7].The following paragraphs review the most widespread optimization algorithms.

❒
ISSN: 2252-8938 The well-known stochastic gradient descent (SGD) [8], [9] is reasonably both effective and efficient since computing 1 st partial derivative w.r.t. each parameter has equivalent processing difficulty to function evaluation.Nevertheless, the learning rate initialization must be adjusted for SGD.The convergence will slow down or vibrate and the training algorithm may be diverged as a consequence of inadequate learning rate [10].
Another popular optimization algorithm is the adaptive learning rate method (AdaGrad) [11].By adapting learning rate to parameters, AdaGrad updates unusual parameters more than recurring parameters.Although AdaGrad avoids the necessity for a manual learning rate adjustment, the denominator's gradients' squares accumulation may decrease learning rate as well as delay the convergence speed.
To remedy AdaGrad's flaw, AdaDelta and Adam methods were proposed [12], [13].The adaptive delta (AdaDelta) is an SGD version based on the adaptive learning rate per dimension.It tackles two limitations: the necessity for a manually chosen global learning rate and the continuous decline in learning rates during training.Instead of accumulating all prior gradients in AdaGrad, AdaDelta adjusts the learning rates according to a changing window of gradient updates [12], [14].
Adam refers to the adaptive moment estimation algorithm that aims to calculate the rates of adaptive learning for each parameter.It uses training data rather than the classical stochastic gradient descent procedure to iteratively update the network weights [13].In order to converge faster, Adam uses momentum and adaptive learning rates.The maximum adaptive moment estimation (AdaMax) is an extension of Adam by generalizing to infinity norm.It is effective in terms of computing and is ideally suited to situations with large datasets [15].The Nesterov-accelerated adaptive moment estimation (Nadam) algorithm is another improved version of Adam which alters the Adam's moment component by Nesterov´s accelerated gradient (NAG) to speed up the convergence process and enhance the learned models [16].
Instead of using cumulative or accumulative sum of squared gradients like Adagrad, the root mean square propagation (RMSProp) method uses exponentially decaying average of squared gradient and does not consider the history from the extreme past.As a result, the algorithm converges rapidly once it finds the locally convex bowl [17].During an update, Chandra and Sharma [18] also applied the Laplacian score approach in conjunction with an adaptive learning rate to modify the weights in mini-batches.
Despite the abundance of optimization algorithms, existing deep learning-based handwritten digit recognition approaches have not investigated their performance when different optimization algorithm is applied.Hence, the reported results may not reflect the optimal performance of these approaches.The main contribution of this paper is an empirical examination of the performance of seven optimization algorithms applied in a convolutional neural network (CNN)-based handwritten digit recognition approach.To the best of our knowledge, this study represents the first attempt to highlight the effect of these algorithms on CNN's accuracy, loss, error, and processing time in the handwritten digit recognition problem.The remaining part of this article is structured as follows.Section 2 describes the CNN architecture.Section 3 elaborates the optimization algorithms used in this evaluation.The standard handwritten digit datasets are explained in section 4. Section 5 reports the experimental results.Finally, section 6 outlines the conclusion of this work.

CONVOLUTIONAL NEURAL NETWORK (CNN)
This section provides a detailed explanation of the convolutional neural network used in our analytical study.The network consists of input (convolutional) layer, pooling layer, dropout (regularization) layer, flatten layer, fully connected layer, and output layer as shown in Figure 1 [19], [20].The convolutional layer receives the input images and is composed of thirty two five-by-five feature maps with a rectifier transfer function.The convolution process is demonstrated as follows: − Each layer is denoted by l th where l = 1 is the first layer and l = L is the last layer.− The input image is denoted by x of size H × W with indices i and j. − The filter or kernel is denoted by w of size k 1 × k 2 with indices m and n. − The weight matrix that connects neurons of layer l with those of layer l − 1 is denoted by w l m,n .− At layer l, the bias unit is denoted by b l .− At layer l, the convolved input vector x l i,j is defined by: Int J Artif Intell, Vol.− At layer l, the output vector o l i,j is defined by applying the activation function f (.) to the convolved input vector: The pooling layer is designed with a pool size of 2×2 and it takes the max in the pool.The purpose of dropout layer is to decrease the network overfitting by excluding 20% of the layer's neurons.The flatten layer aims at transforming the 2D matrix data to a flatten vector that can be processed by the subsequent fully connected layer which consists of one hundred twenty-two neurons and a rectifier transfer function.The output layer is comprised of ten neurons for ten classes and employs a softmax transfer function to give a probability score for each class prediction.The CNN model is trained using each of the seven optimization algorithms described in the next section.

OPTIMIZATION ALGORITHMS
This section presents a thorough description of the seven optimization algorithms used in this work.The gradient descent (GD) optimization algorithm is the one that is most frequently applied to address classification issues.The performance of CNN was enhanced using a variety of GD algorithm optimizers, including SGD, AdaGrad, AdaDelta, Adam, AdaMax, Nadam, and RMSprop.The following subsections elaborate each algorithm in more detail.

Stochastic gradient descent (SGD)
The most prevailing algorithm used to train deep neural networks is SGD.This technique has shown a well performance in many applications.It moves the parameters of the model in the opposite direction of the minibatch-evaluated gradient to achieve a reduced loss [21].Its fast work and prevention of redundant computation have made the SGD algorithm the preferred choice in deep learning techniques.In this method, the training data and learning rate are denoted by x i , y i and η, respectively.The term ∇ θi E(θ i ) refers to a gradient of error (loss) function E(θ i ) with respect to the parameter θ i , and momentum factor (m). Then (3), is used to update the learning parameters for each training pair, until the stopping condition is met, which yields the result θ i+1 [22].However, the SGD method suffers from the oscillation in the gradient direction due to two reasons.Firstly, the extra noise resulted from the arbitrary selection.Secondly, the procedure of blind search in the solution space.Moreover, the gradients' variance and the movement direction in SGD are large and biased, respectively [7].

Adaptive gradient (AdaGrad)
Another gradient-based optimization algorithm that individually adjusts the learning rates of model parameters is AdaGrad.The primary concept of this technique is adopting a lower learning rate for parameters that correspond to recurrent attributes and a higher learning rate for parameters that correspond to rare traits.This is performed by using all the historical squared gradient values [23].The Adagrad optimizer formulates the update equation as defined in (4).
Performance analysis of optimization algorithms for ... (Abdulhakeem Qusay Albayati) Where g t,i stands for the loss function gradient for parameter θ t,i at time step t.It represents a smoothing term considered to prevent the division by 0. The diagonal matrix denoted by G t,ii represents the sum of squares of gradient for parameter θ t,i at time step t.This method eliminates the necessity for adjusting the learning rate manually.However, the learning rate may shrink, and the convergence may decelerate due to accumulating the squares of gradients in the denominator [22].

Adaptive delta (AdaDelta)
AdaDelta is a stochastic optimization technique that was introduced to alleviate the limitation of Ada-Grad, through reducing the accumulation in the denominator.It is fully automatic and does not need a default value of learning rate.This optimizer limits the number of previously gathered gradients to a defined size ω rather than aggregating all previously squared gradients [24].Instead of inefficiently storing gradients, the sum of gradients is repeatedly calculated as a falling average of all ω squares of gradients before it.Then, the running average E[g 2 ] t at time step t is dependent solely on the current gradient and the past average [12], as defined by (5).Typically, δ is set at about 0.9.The update equation of SGD is rewritten concerning the update vector parameter as defined by ( 6) and (7).

Adaptive moment estimation (ADAM)
Adam is a method for optimizing stochastic objective functions.It determines individual adaptive learning rates for each parameter using estimates of the gradients' first and second moments.It is easy to apply, efficient in terms of computation and low-memory, resistent to gradients' diagonal rescaling, and suitable for tasks involving a significant amount of input parameters and data [13].As in AdaDelta, Adam retains an exponentially decaying average of both preceding gradients m t and previous squared gradients ν t as shown in ( 9) and (10) [23].Computing the bias-corrected first and second-moment estimations eliminates the biases to produce equations (11) and (12).Then, the parameters are updated as defined by (13).Adam is an effective technique when used to solve problems with extremely noisy and sparse gradients as well as non-stationary objectives [25].

Maximum adaptive moment estimation (AdaMax)
AdaMax is an Adam's variation that is based on the norm of infinity.It is a type of adaptive SGD.The main advantage of AdaMax over SGD is that it has a very little sensitivity against the choice of hyperparameters.The Adam's second momentum element is fully utilized in the AdaMax equation.This provides a more consistent solution [25].The velocity parameter of the approach modifies the gradient across the ν t and current gradient |g t | 2 terms in inverse proportion to the previous gradients' L 2 norm: Int J Artif Intell, Vol.(15).
3.6.Nesterov-accelerated adaptive moment estimation (Nadam) The momentum and the adaptive learning rate component are the two primary parts of Adam's optimization algorithm.A comparable approach called Nesterov's accelerated gradient (NAG) outperformed the conventional momentum [25].Nadam modifies Adam's momentum component m t defined by ( 17) by updating the gradient g t in addition to updating the parameters θ t−1 rather than the twice addition of the momentum component.Furthermore, the Nadam new update rule is expressed by (18).This modification is to benefit from the NAG, which enhances the speed of convergence and the learned models quality [26].

RMSprop
RMSprop is a popular adaptive stochastic algorithm for deep neural network training.It alters Ada-Grad such that the gradient accumulation is optimized to operate in the non-convex environment.An exponentially weighted moving average is created by aggregating gradients.In addition, current knowledge of the gradient is preserved by RMSProp, discarding the past [27].The RMSProp method divides the learning rate using an average of squared gradients with exponential decay [28], [13], and it is described by (19) and (20), where the suitable values of γ and learning rate η are 0.9, 0.001, respectively.The RMSProp performs effectively in both offline and online environments, and it demands less tuning in comparison with the SGD [29].
DATASETS Over the recent years, a variety of different handwritten image databases have been disclosed to facilitate the handwritten image recognition research.Two most popular datasets namely, Modified National Institute of Standards and Technology (MNIST) [30] and Extended MNIST (EMNIST) [31] were used in our experiments.These datasets were developed and released by National Institute of Standards and Technology (NIST) to evaluate the performance of handwritten digit recognition approaches.The next two subsections describe each dataset in more detail.

Modified National Institute of Standards and Technology (MNIST)
The MNIST dataset [30] consists of seventy thousand instances from which 60, 000 are used for training and 10, 000 are used for testing.This dataset was gathered from two sources: NIST's Special Database 1 and NIST's Special Database 3. The former was gathered from high school students, while the latter was retrieved from Census Bureau staff.The choice of the training and testing sets was made to avoid having the same author contribute to both sets.The training set contains writing examples from more than 250 different authors [32].Preprocessing was applied to the original images.First, photos must be normalized to fit within a 20×20 pixels box while preserving the aspect ratio.The black and white pictures were then converted to grayscale using an anti-aliasing filter.To enlarge the images into a 28×28 pixels box and to ensure that the center of the digit's mass matched its center, blank padding was finally applied [33].Figure 2  The EMNIST database introduced in April 2017 by Cohen et al. [31] provides handwritten digits and characters constructed from the NIST Special Database 19, which contains an entire library of training resources for character recognition and handwritten documents [34].The size of EMNIST is considerably larger than that of MNIST, and the images were collected from a different source.The photos were also converted to a 28 × 28 pixels format similar to MNIST [33]. Figure 3

EXPERIMENTS AND RESULTS
This section provides an experimental performance assessment of the previously mentioned optimization algorithms.The experiments were conducted using the benchmark MNIST and EMNIST digit datasets.Various performance metrics including processing time, error rate, loss, accuracy, validation loss, and validation accuracy were adopted to demonstrate the effectiveness of competing algorithms.The experiments were implemented using the Keras Python library in the Jupyter notebook platform on a Windows 10 MSI laptop with Intel Core i7-11800H CPU 2.3 GHz and 16 GB RAM.
Table 1 shows the resulting outcomes of MNIST experiments.As evident by the results highlighted in bold, the Adamax method showed the fastest performance.However, the RMSprop algorithm showed an outstanding performance outperforming the peer algorithms in terms of CNN error, loss, accuracy, validation loss, and validation accuracy.The results of EMNIST digit dataset experiments are reported in Table 2.The scores highlighted in bold indicate that the SGD method achieved the fastest performance, while the Adamax attained the lowest validation loss.However, the Adam algorithm outperformed its counterparts according to CNN error, loss, accuracy, and validation accuracy.
The chart in Figure 4 compares the accuracy rates of each algorithm in both datasets.As can be seen in the figure, every algorithm, except RMSprop, produced a higher accuracy rate in EMNIST than that in MNIST.One possible reason to justify this performance is that the size of training data in EMNIST is much larger than that in MNIST.It should also be mentioned that the Adam, Adamax, Nadam, and RMSprop methods produced higher accuracy rates than that of SGD, Adagrad, and Adadelta algorithms in both datasets.CONCLUSION This paper has presented a performance evaluation of seven optimization algorithms for handwritten digit recognition using a convolutional neural network.These algorithms include SGD, AdaGrad, AdaDelta, ADAM, AdaMax, Nadam, and RMSprop.We used two benchmark datasets namely, MNIST and EMNIST to conduct the experiments.The performance was assessed using several standard metrics including processing time, error rate, loss, accuracy, validation loss, and validation accuracy.The experimental results exposed two preferred methods.The overall highest scores of MNIST experiments proved that the RMSprop method is the best choice for CNN-based handwritten digit recognition.On the other hand, the Adam method was proved as the best approach as demonstrated by the highest scores of EMNIST experiments.Although the two methods showed excellent performance results, they still cannot reach 100% accuracy.This can be considered as a limitation of the two optimization algorithms, which requires further investigation.

Figure 1 .
Figure 1.Convolutional neural network shows samples of the MNIST dataset.Performance analysis of optimization algorithms for ... (Abdulhakeem Qusay Albayati) ❒ ISSN: 2252-8938 4.2.Extended Modified National Institute of Standards and Technology (EMNIST) demonstrates samples of the EMNIST dataset.The EMNIST digit dataset contains 280, 000 characters in 10 balanced classes.It comprises 235, 000 training images, 40, 000 testing images, and 5000 validation set images.It presents the data in two different forms, all of which include identical information.It is offered in MATLAB format in the first format.In addition, the second form gives it in a binary format similar to that of the original MNIST dataset[35].

Figure 2 .Figure 3 .
Figure 2. A hundred and sixty samples from the MNIST

Figure 4 .
Figure 4. Accuracy of each algorithm in both datasets

Table 1 .
Results on MNIST dataset

Table 2 .
Results on EMNIST digit dataset