Deep ensemble learning for skin lesions classification with convolutional neural network

Received Mar 12, 2020 Revised May 22, 2021 Accepted Jun 6, 2021 One type of skin cancer that is considered a malignant tumor is melanoma. Such a dangerous disease can cause a lot of death in the world. The early detection of skin lesions becomes an important task in the diagnosis of skin cancer. Recently, a machine learning paradigm emerged known as deep learning (DL) utilized for skin lesions classification. However, in some previous studies by using seven class images diagnostic of skin lesions classification based on a single DL approach with CNNs architecture does not produce a satisfying performance. The DL approach allows the development of a medical image analysis system for improving performance, such as the deep convolutional neural networks (DCNNs) method. In this study, we propose an ensemble learning approach that combines three DCNNs architectures such as Inception V3, Inception ResNet V2 and DenseNet 201 for improving the performance in terms of accuracy, sensitivity, specificity, precision, and F1-score. Seven classes of dermoscopy image categories of skin lesions are utilized with 10015 dermoscopy images from well-known the HAM10000 dataset. The proposed model produces good classification performance with 97.23% accuracy, 90.12% sensitivity, 97.73% specificity, 82.01% precision, and 85.01% F1-Score. This method gives promising results in classifying skin lesions for cancer diagnosis.


INTRODUCTION
Skin lesions are skin tissue that has an abnormal growth or appearance compared to the surrounding skin [1]. Some types of lesions can be potentially cancerous. One type of skin cancer that is considered a malignant tumor is melanoma [2]. Melanoma cancer causes many deaths in the world [3]. So it is very important to diagnose melanoma at an early stage so that patient survival can be improved [4]. Melanoma can be detected by medical diagnosis using digital imaging or called dermoscopy [5]. Dermoscopy is a noninvasive imaging technique that obtains an enlarged image of a skin lesion by the use of polarized light [6]. This technique visualizes features of skin pigmentation lesions that cannot be seen and assessed directly [7]. Although this method increases the accuracy of diagnosis, the process is very complex and error-prone [5]. Therefore, medical diagnosis using digital imaging with a computerized system is automatically needed in decision making.
The classification of dermoscopic images of skin lesions has been studied for a long time in much of the literature. Research on the classification of skin lesions through dermoscopy images have proposed by using various methods, from conventional methods such as traditional feature extraction techniques based on texture and color features [8], border-based texture analysis and wavelet decomposition [9], and rule-based image processing or segmentation algorithms [10]. Other classification methods by using machine learning (ML) such as, support vector machine (SVM) [11], decision tree [12], and neural network [13]. However, these methods are still not optimal because the classification process of dermoscopy images requires feature extraction and additional processing from the dataset where the model parameters are not directly reduced [14]. At present, there is a ML method developed can overcome the feature extraction problem with a different structure-based feature learning. In such an approach, the structure of learning that is carried out indepth to recognize the image, named deep learning (DL).
In recent years there are many studies utilize the DL approach for medical applications [15]- [17]. The advantage of such a method is a feature learning model that is used automatically to handle large data sets. Convolutional neural networks (CNNs) method is one of the DL methods that is considered to have the best architecture in several applications for image classification. Especially in a medical application, such CNNs approach with many architectures indicates good performance in the classification, segmentation and, detection task of medical images [16]. CNNs have a significant conceptual framework including weight sharing, local perception area, and down sampling space. In this method displacement, distortion and scaling characteristics are relatively unchanged [18]. Specifically, in the classification of skin lesions to detect skin cancer, some researchers have applied single CNNs with good results [19]- [21]. However, the previous research based on CNNs method to classify skin lesions with limited class, only two or three classes of skin lesions. There are many categories of dermoscopy images to diagnose. If the single CNNs architecture used for several skin lesions classification, the classifier performance is decreased. Hence, the classification of skin lesions uses several classes with good performance in terms of accuracy, sensitivity, specificity, precision, and F1-Score is desirable.
In this paper, we propose a deep convolutional neural network approach to classify several categories of diagnosis of skin lesions. To improve the classification performance, we elaborate based on the ensembles of three CNN architectures such as, Inception V3, Inception ResNet V2, and DenseNet 201. This paper is organized as follows. In section 2, we provide a brief description of the dataset, pre-processing data, and CNN classifiers with an ensemble model using three superior CNNs architectures by using combination Inception V3, Inception ResNet V2, and DenseNet 201. In section 3, we present the results and analysis of the experiment. Finally, in section 4 we draw some conclusions.

MATERIAL AND METHODS
This paper proposes a new approach to classifying skin lesions into seven different classes. We use the ensemble model by combining three CNNs architectures such as Inception ResNet V2, Inception V3, and DenseNet-201. This method consists of preprocessing data, ensembles of CNNs, and classifier performance evaluation based on performance metrics.

Preprocessing data
The dataset as input from the network system is a dermoscopy image of The HAM10000 dataset that is publicly available through the International Skin Imaging Collaboration (ISIC) 2018 archive [22]. This dataset is obtained from patients of various ages and genders. Dermoscopy image samples can be seen in Figure 1, which shows one sample in each diagnostic category. The dataset contains 10,015 dermoscopy images describing all-important diagnostic groups in the field of pigmented skin lesions, such as actinic keratoses and intraepithelial carcinoma/Bowen's disease (akiec, 327 images), basal cell carcinoma (bcc, 514 images), benign keratosis-like lesions (bkl, 1099 images), dermatofibroma (df, 115 images), melanoma (mel, 1113 images), melanocytic nevi (nv, 6705 images), and vascular lesions (vasc, 142 images) as summarize in Table 1.
From the dataset, the original image in JPEG format with 450×600 pixels which is too large, the image is resized to become 192×256 pixels. In this paper, the stratified method was applied to split the dataset into 8111 images for the training set, 902 images for the validation set, and 1002 images for the testing set. Then, the image is normalized by dividing by a value of 255. To increase the amount of training data without removing the essence of the data, a real-time data augmentation module was also added to our platform. The purpose of data augmentation to increase the number of skin images, and for reducing the overfitting of the network. The data augmentation method is rotation with an angle of 60⁰, shear with probability 0.2, zoom with probability 0.2, width shift with probability 0.2, and height shift with probability 0.2.

Convolutional neural network (CNN)
CNN is an artificial neural network using a grid-like structure designed for data processing such as images. A simple CNN architecture usually has four layers such as convolutional layer, rectifying linear unit (ReLU) layer, pooling layer, and fully connected layer [23]. CNNs have a hierarchical architecture, starting from the input signal , each subsequent layer given by, where is a linear operator in convolution layer, and is a rectifier max( , 0) or sigmoid 1 1 + exp (− ) ⁄ .
The operator as a stack of convolutions of the previous layer and it defined (2), Here * is the discrete convolution operator: The problem of optimization described by a CNN is extremely non-convex. Using the backpropagation algorithm to calculate gradients, the weights are learned by stochastic gradient descent. The convolution layer is used to study features and identify classes from image datasets. The convolution operation with the ReLU activation function on CNNs is expressed, where kij is the convolution kernel, bj is bias and ⊗ indicates the convolution operation. The convolution operation is a matrix multiplication between the image input and the kernel where the output can be calculated by the dot product. The representation of CNNs architecture for skin lesion classification can be seen in Figure 2. CNNs are a DL architecture that is widely used to classify diseases from medical images [17]. In the previous studies, CNNs has proposed for the classification of skin lesions through dermoscopy images and produces superior performance [19], [24], [25]. This architecture is utilized in extracting features from dermoscopy images. In this paper, we elaborate on the performance of Inception V3 [26], Inception ResNet V2 [27] and DenseNet 201 [28] separately and ensemble models of the three architectures for the classification of 7 classes of skin lesions. We have modified the last fully connected layer on three architectures and replaced it with a new fully-connected layer (consisting of one global max-pooling layer, one fully connected layer with 512 neurons, one dropout layer with a probability of 0.5 and output layer with a SoftMax activation function for classifying 7 types of skin lesions). Inception V3 is an architecture based on the inception module. This architecture consists of 9 Inception modules with 22 convolutional layers. The Inception module has 3 different sizes for convolution layers with kernel filters (5×5, 3×3, and 1×1) and pooling layers with 3×3 filters [26]. In research [27], improvements have been introduced by releasing the Inception ResNet V2 architecture. Inception ResNet V2 architecture is a variation of the Inception V3 model. Along with Inception V3 and Inception ResNet V2, use DenseNet 201. Dense convolutional network (DenseNet) is an architecture that connects two layers with the same feature map size. DenseNets has many benefits, including reducing issues with vanishing gradients, enhancing the propagation of features, facilitating the reuse of features and greatly reducing the number of parameters [28]. All CNNs architecture becomes an ensemble learning architecture as described in Figure 3.

Training and testing
Our model is implemented on the Python artificial intelligence framework. We use the ReLU activation function at the fully connected layer, SoftMax activation function at the output layer, loss function with categorical cross-entropy, Adam optimization algorithm, and the learning rate is initialized at 0.0001. We initialize the pre-trained weights on ImageNet for network parameters. In this paper, the approach used to extract features is to run images through a pre-trained network as a feature extractor. Then, fine-tuning aims to extract more specific features. At this stage, several experiments were carried out. The first experiment is fine-tuning for the Inception V3 model by training all layers. The second experiment was fine-tuning the Inception ResNet V2 model by training at the top layer. The third experiment was fine-tuning the DenseNet 201 model by training on all layers. After training the individual CNN models, we made an ensemble by combining all three models to classify 7 types of skin lesions. This ensemble model combines and takes the average output probability of the three models. The CNNs model was implemented using CNNs with ensemble learning on a tesla NVIDIA GeForce RTX 2080 GPU and processor Intel(R) Core ™ version 9 with 3.60 GHz processor clock rate. Processing each image took 9s and 147 ms at test time.

EXPERIMENTAL RESULTS AND ANALYSIS
For testing the proposed model, 1002 dermoscopy images are utilized. Before we conducted the ensemble model, nine CNNs architecture is developed to see the ability of CNNs model refer to Table 2. Table 2 shown the high performance is achieved by DenseNet 201 architecture outperformed other architecture, but the sensitivity value still below 90%, and over-fitting between training and testing has occurred. Therefore, to overcome the drawback, the ensemble architecture is conducted by using three architectures the Inception V3 model, the Inception ResNet V2 model, and the DenseNet 201. The three CNNs architecture with the best performance is combined to calculate the average value in terms of accuracy, sensitivity, specificity, precision, and F1 score. All value was obtained from the confusion matrix. The classification performance results are described in Table 3 and Figure 4 as a confusion matrix.
The single DenseNet 201 with fine-tuning model gives very good results even though this model does not have many parameters such as Inception V3 and Inception ResNet V2. The performance of DenseNet 201 in this experiment shows that this model can be used to train different datasets. Using the ensemble learning approach, we made an ensemble of Inception V3, Inception ResNet V2 and DenseNet 201, which had been fine-tuned before, and showed the best classification results with an average accuracy of 97.23%. The ensemble learning approach of these three architectures can improve the accuracy and prediction results by taking the average probability of output from the model. Furthermore, a comparative analysis was carried out in assessing the method proposed in this paper by involving the comparison of several studies in the multi-class classification of skin lesions. Table 4 is summarized the output of the proposed method approach and compared with the previous approach on the multi-class classification of skin lesions using dermoscopy images. As shown in Table 4, the proposed ensemble model produces better accuracy than other studies. The results of the performance are validated in the training, validation, and testing process to guarantee better performance because of a deeper architecture of the ensemble model.
Unfortunately, this study has several limitations such as dermoscopy image number from dataset is imbalanced and the single CNN model still produces overfitting. We did not succeed in carrying out better training strategies that could help the single CNN model in achieving better results to decrease the overfitting. Through experimentation, we also know that the transfer learning approach does not provide good performance for this dataset because of the main differences in features between dermoscopic images and ImageNet. The deep investigation is needed for future work, in avoiding overfitting and examining ensemble techniques and other CNNs classification architectures in the task of classifying skin lesions.

CONCLUSION
Due to high inter-class similarities and intra-class differences between lesions in terms of color, size, location, and appearance, it is a very challenging work to diagnose skin lesions. In this study, we propose an ensemble learning approach that can classify seven diagnostic categories of skin lesions. We compiled an ensemble model by combining three deep CNN architectures such as Inception V3, Inception ResNet V2 and DenseNet 201. Our model successfully classifies 7 classes of skin lesions with average accuracy, average precision, average sensitivity, average specificity, and F1 scores averaged 97.23%, 90.12%, 97.73%, 82.01%, and 85.01% respectively. Based on the proposed ensemble model, we can achieve the classification performance is much higher than the results of previous studies. The experimental results indicate that the proposed frameworks exhibit promising results.