Optimizer algorithms and convolutional neural networks for text classification

ABSTRACT


INTRODUCTION
Recently, text classification (TC) attends a crucial interest in the natural language processing (NLP) field in light of the upgrading in deep learning research [1].Actually, TC is the task of extracting labels from a given text data based on features selection [2] and it is used in numerous applications such as spam detection, topic labelling, question answering, and sentiment analysis [1].The latter designates the task of identifying the polarity from reviews and opinions text either by multiple or binary classification [3].Using deep learning algorithms, many researchers propose different methods and architectures to highly increase the performances in TC and sentiment analysis problems.Pal et al. [4] and Chamekh et al. [5] have adopted recurrent neural networks (RNN) models and long short-term memory (LSTM).Meanwhile, Sachin et al. [6] and Zulqarnain et al. [7] have employed gated recurrent units (GRU).Besides, Kim et al. [8] and Feng et al. [9] have implemented convolutional neural network (CNN) models, while Jain et al. [10] and Rehman et al. [11] have proposed hybrid models using CNN and RNN layers.
Despite the efforts made in these topics, these problems need more experimental aspects.In practice, the efficiency of a deep learning model not only relies on the used architectures, layers and activation functions, but also on the selection of the appropriate optimizers [12].In effect, the choice of an optimizer almost stands on best practices, online recommendations or even on random selections, and not relies on an empirical evidence approach due to the insufficiency of experiments [12].Indeed, we proffer in this paper a new CNN architecture to binary classify text reviews into positive and negative, then, we have applied multiple deep learning optimizers in our CNN so as to determine the best and relevant optimization algorithm for a such classification.Also, we have trained our model on three different datasets and we have examined the performance of our model with the optimizers using the accuracy metric.
The contributions of our paper are like so: i) A new CNN architecture for a binary classification of text reviews; ii) Our CNN model reaches a good accuracy and great performance against the state-of-the-art models; and iii) The adaptative optimizers algorithms perform better using the new CNN in text reviews classification compared to other optimizers.
The remain of the paper is arranged such as; i) Section 2 displays a review of some corresponded studies that employed CNN models for TC problem; ii) In section 3, we proposed our deep learning CNN architecture, the datasets, and some model settings plus the implemented optimizers; iii) Section 4 illustrates the experimental results; and iv) Lastly, we conclude our paper and we present some perspectives.

RELATED WORK
In this section we explore some state-of-the-art studies which utilized deep learning algorithms and CNN models for TC purposes.Actually, CNNs become a common and an efficient model architecture for TC problems [13].Over the years, many researchers proposed different CNN-based models aiming to extract features from text and predict the intended labels.Kalchbrenner et al. [14] have suggested a model called dynamic CNN (DCNN).As shows Figure 1, the DCNN involves an embedding layer that builds a sentence matrix for every word in a given sentence.Then, the wide convolutional layers and the dynamic pooling layers map over the sentence to produce a feature relation between the words in the sentence.In practice, the dynamic k-max-pooling parameter takes value based on the sentence length and the position of the convolutional layer.
Figure 1.The DCNN architecture [14] Next, Kim [15] have presented a light CNN architecture as illustrated in Figure 2, based on one convolutional layer and filters for TC problem.Actually, the Kim's model contains an embedding layer, a Int J Artif Intell ISSN: 2252-8938  convolutional and a max pooling layer, followed by a fully connected layer with dropout, plus a softmax output.Effectively, the author has used the unsupervised embedding model word2vec, and he has compared four initialization approaches to learn the word embeddings.All the Kim's approaches have enhanced the researches in TC and sentiment analysis problems with CNN [13].
Figure 2. The Kim's CNN architecture [15] In fact, there have been many attempts to improve the Kim's architecture.Johnson and Zhang [16] have trained the embedding of small parts of text using an unlabeled text data, then the embeddings have been feed to the CNN model as labeled data for TC.As well, the authors have suggested a deep pyramid CNN (DPCNN) [17] which included a deep neural network to increase the computational complexity and also its performance.Therefore, Liao et al. [18] have converted the input sentences into matrices, then each sentence matrix has been represented by a word vector which forms the embeddings for the CNN architecture.The proposed CNN was able to understand sentiments from the tweets.Afterwards, [8], [19] have improved the architecture of CNN model by using consecutive convolutional layers and they have reached good accuracies for sentiment classification.Besides, [20] and [21] have examined different CNN settings to find the optimal CNN configuration and to improve the performance for TC.On the other hand, several recent studies [10], [11], [22] have merged CNN with LSTM for TC purpuses.Actually, the authors have fed the convolutional layers of CNN with word embeddings, then the output has been appended to the LSTM layers in order to learn long-term dependences between words.Finally, the softmax layer takes the output from the LSTM layers and produces the classification result.

RESEARCH METHOD 3.1. The proposed network
Our suggested CNN model as described in Figure 3, employed two convolutional layers and a max pooling layer, plus two fully connected layers.Actually, we started tokenizing our train data through a vocabulary file which contains the most frequent words.Then, we randomly initialize the embedding layer to extract meaningful features from the train process.In practice, the embedding layer received the input words and produced feature values for them.Later, each word will be regrouped standing on the learned meaning.Afterword, the two convolutional layers took the output from the embedding layer, slid a window using a kernel size, and applied filters for every window in order to collect more features.Indeed, we appended a dropout layer to each convolutional layer so as to ignore non optimal features.Next, the max-pooling layer selected the maximum values from the the convolutional layers and provided this output to two fully-connected layers.In effect, we applied a dropout layer to the first fully-connected layer intending to avoid overfitting, and an activation function rectified linear unit (ReLU) for the exponential growth in computation.Later, the second fully-connected layer produced the vector result which involves a positive or negative classification value.Finally, a Softmax function predicted the label result based on a probability calculation to each class.In regards to the optimization, we applied the binary cross entropy as a loss function, then we compare a set of optimizers with our suggested CNN to identify the best implemented models.The results of our architecture network with several optimizer algorithms are presented in the next section.

Experiments
Our experiments were performed with Python and TensorFlow framework in Google Colab notebook using Google compute engine backend, central processing unit (CPU) mode and 12.68 GB of memory.Actually, we implemented our CNN model to binary classify reviews into positive and negative from three popular datasets in TC: Amazon reviews [23], internet movie database (IMDb) movie-reviews [24], [25], and rotten tomatoes movie-reviews data [24], [26].In addition, we experimented a set of optimizers with our CNN model to determine, using empirical examination, the best optimizer for TC and sentiment analysis problems.Momentum [28]: Actually, using the SGD takes a noisy and more steeply path than GD because of changing parameters in each training example, which means a slow computation time to reach the optimal minimum.Hence, the momentum algorithm surpasses this problem by appending a fragment of a previous oscillation update to the current oscillation, so the process accelerates the time steps and becomes faster.− Adagrad [29]: is a gradient-based optimizer that adapts the learning rate standing on frequent and infrequent parameters.The more the parameters change, the less the learning rate gets updates.Otherwise, it generates a little update of the learning rate for frequent parameters and a large update for the infrequent ones.Therefore, it is widely used in case of a sparse data training.

−
AdagradDA [29]: refers to adagrad dual averaging which is an adagrad-based algorithm.This optimizer adjusts the regularization of unseen features on each mini batch.Indeed, AdagradDA is basically applied for large sparsity in linear models.− Adadelta [30], [31]: is an advanced algorithm of Adagrad optimizer that adjusts the decaying of the learning rate whereby the model could learn more features.In practice, the algorithm utilizes variables to fix the size of some accumulated gradients.− FTRL [32]: "Follow the (Proximally) regularized leader" is a GD-based algorithm with an alternative representation of the L1 regularization and model coefficients.The optimizer uses a per-coordinate learning rates; besides, it has a high sparsity and convergence properties.

−
Root mean squared propagation (RMSprop) [33]: is an unpublished optimizer suggested in coursera class by Geoff Hinton [33].The optimizer stands on an adaptive learning rate method.Similar to Adadelta, RMSprop reduces the monotonically decreasing learning rate and accelerates the optimization.In effect, the algorithm utilizes an average of squared gradients that decays exponentially for dividing the learning rate.

−
Adam [34]: or adaptive moment estimation is an optimizer that employs adaptive learning rates to update every network weights parameter.Actually, Adam is an alternative extension of SGD and also inherits features from Adagrad and RMSprop.Effectively, it requires fewer parameters tuning and lower memory requirements.Furthermore, it is widely used to solve non-convex problems with large datasets in a faster running time.
Next, we present some parameter values we used in our CNN model, − For embedding layer, we employed an embedding dimension Ed=150 with a sequence length of S=500 and a maximum size of vocabulary words of vocab_size=5,000.

−
In connected layers, we defined the count of hidden layers as H1=128, H2=164.For the optimization, we set a dropout to 0.5 and a learning rate with 1e-3 to initialize the whole optimizer algorithms.

Datasets
In the current section, we proffer some information on the implemented data.Actually, we applied all our CNN models on three text reviews datasets with different sizes.As shown in Table 1, we have classified the datasets into large, medium, and small regarding the number of reviews.More details on datasets are given: − Amazon reviews [23] contains 4,000,000 customers' reviews up to March 2013 about several product categories.Besides, the data is labeled into two classes depending on the review scores ratings from 1 to 5. 'Positive' is represented by 5 and 4 stars, and 'Negative' by 1 and 2 stars.For experiments, we employed 100,000 reviews from the data which represents the large dataset type.− Rotten Tomatoes (RT) reviews dataset [24], [26] includes 5331 positive snippets of text RT moviereviews and 5331 negatives ones.RT was first utilized in Pang/Lee ACL 2005 [26] and it is a medium dataset in comparison with the previous one.− IMDB reviews [24], [25] is about 2,000 sentiment text reviews regarding movies.The data contains 1,000 negative and 1000 positive sentences introduced in Pang/Lee ACL 2004 [25].The IMDB movie-reviews is considered as a small dataset.Practically, we split each data-type into three sets: train, validation and test.Then, we pre-processed our train data using several text filters in order to remove noisy contents.Afterwards, we build from each input sentence the label and the content to train our models.The results of each model are represented and described in the next section.

RESULTS AND DISCUSSION
In the current section, we describe the obtained results using different optimizer algorithms with our CNN architecture.As shown in Table 2, we applied each one of the optimizer models on three types of datasets: large, medium and small to explore the impact of the data amount on the optimizer's efficiency.In effect, we evaluated the efficiency of the optimizer's models by the accuracy metric.
For readability, we entitled the models using the selected optimizer for the CNN architecture.For example, CNN-Gradient-descent represents the model which employed the Gradient descent as an optimizer in the CNN model.As a loss function, we utlized the cross entropy in the whole models.Actually, the results illustrate that the best CNN model for the small and medium-data is CNN-Adam.However, the CNN-RMSprop has surpassed the CNN-Adam model in the large-dataset, despite the good accuracy reached by CNN-Adam.In practice, we notice that the accuracy in the CNN-Gradient-descent model is poor and changed the values by little steps, which means that the model learned slowly even if the data amount gets larger.Otherwise, with CNN-Momentum model, the accuracy increased and the model obtained better performance.In effect, the momentum method accelerates GD in the pertinent directions and reduces oscillations.On the other hand, the other optimizers have attained inadequate performances and their accuracy kept a stable value in the three types of datasets.Meanwhile, the RMS-prop model achieved a good accuracy in the large-data and overall, it performed better than the other models in the small and medium-datasets.Actually, the RMS-prop converges faster and requires less parameters tuning than GD algorithms and their variants.Incidentally, the CNN-Adam displayed its advancement opposing all the other optimizers and it achieved great efficiencies regardless the data amount.In fact, the Adam optimizer takes advantages from various optimizer algorithms and overpasses the other optimizers in term of computation time, parameter requirements and ease of implementation.
Figures 4 and 5 show the validation accuracy of the three datasets for our best performed models; CNN-Adam and CNN-RMSprop.Actually, we notice that the more the data get larger, the more the accuracy increases and achieves good performance.Also, the two optimizers with our CNN architecture have made a well progress and a good curve trace in the first 100 epochs.Table 1.Comparison of the results of our best CNN model and some related works models Model Accuracy (%) Chen's and Wang's CNN [22] 76.09 Kim's and Jeong's CNN [8] 81.06 Kim's CNN [15] 81.5 Johnson's and Zhang's CNN [16] 85.7 Feng's and Cheng's CNN [9] 86.32 DCNN [14] 86.8 Rehman's et al.CNN [11] 87 Jain's et al.CNN [10] 87.1 CNN-RMSprop (our model) 90.48

CONCLUSION AND PERSPECTIVES
In this paper, we suggest a new CNN architecture to binary classify text reviews as negative or positive.Our suggested model has examined a set of optimizer algorithms to evaluate empirically the best optimizer for sentiment analysis problem.The experiments had shown that RMSprop and Adam are the most efficient models.Moreover, we obtained great performances compared with the mentioned state of the art architectures by reaching an accuracy of 90.48%.As a perspective, we plan to implement our model for a multi-classification text problem and promote the architecture performances.

Figure 3 .
Figure 3.The proposed CNN model for text classification

Table 1 .
Number of reviews and types of data in the three datasets

Table 2 .
The results of the optimizer models for each dataset-type