Combining convolutional neural networks and spatial-channel “squeeze and excitation” block for multiple-label image classification

ABSTRACT


INTRODUCTION
Nowadays, catheters or tubes are used to keep critically ill patients alive in emergency departments and intensive care units [1].Tubes have many advantages.For example, an endotracheal tube (ETT) is used to provide oxygen to the lungs in the case of respiratory issues or lung illnesses.The nasogastric tube (NGT) is useful for individuals who are unable to eat or who consume less than 60% of their daily calorie requirements [2].
Misplacements of tubes can occur in three common tubes.First, in the case of ETT misplacement, the correct ETT position is between 5 and 7 cm above the carina.The Carina position is used for estimating the position of the ETT.The result of ETT misplacement is that accidental intubation of a bronchus is more common on the right main bronchus and the left main bronchus.Second, a chest X-rays is used to determine the position of the NGT.Correct the tip of the NGT position so that it is visible at least 10 cm.beyond the esophagus junction.NGT misplacement that might be passed into the airways.Tube misplacement has many issues, such as looped tube misplacement in the right main bronchus.Finally, a chest X-rays determines the central venous catheter's tip position.Correct central venous catheter (CVC) position is the position of the tip of the CVC within the superior vena cava at or just above the level of the carina.CVC misplacement could lead to many complications, such as pneumothorax or pneumomediastinum [3].Evidently, the proportions of mispositioned cases encountered have been found to be 5%-28% of ETT, about 15% of NGT, and 2%-7% of CVC, respectively.If catheters are placed incorrectly, severe complications might arise.Commonly, chest X-rays are used to estimate the positioning of various medical equipment and to assess any associated issues ISSN: 2252-8938  Combining convolutional neural networks and spatial-channel … (Thuvanan Borvornvitchotikarn)

369
following placement or misplacement [1].However, each day, radiologists may obtain many patients to radiograph and interpret them, and there may be delays.An assessment of tube position is complicated.
Computer-aided techniques have the potential to help radiologists prioritize radiographs with possibly mispositioned catheters for interpretation and to automatically include information suggesting catheter placement in radiology reports, therefore increasing radiologists' performance [1].
In the literature, the classification of chest X-rays images of catheters and tubes has been researched for a long time.The classification methods for tubes have been proposed using various methods, including rule-based [4], and decision trees [5].Other classification methods based on deep learning (DL) such as, Alexnet [6], GoogLeNet [6], UNET [7], ResNet [8], and EfficientNet [9].However, these methods are still not optimal in multiple classification tasks, these methods are still inefficient because image classification involves multiple tubes from the current intubation depending upon the patient.
In this work, we conduct a study that focuses on the multi-label classification task.We have 11 targets in chest X-rays images, and they can be divided into 4 groups: ETT, NGT, CVC, and Swan.This paper is organized as follows.Section 2 presents the research method.Section 3 describes the experimental results and conclusion will be discussed in section 4, respectively.

METHOD
In this section, we will explain this research model for tube image classification, beginning with the implementation and moving on to the attention mechanism and the head layers.Section 2.1 presents the descriptions of the materials in our work.The details about the attention mechanism are presented in section 2.2.The explanation of well-known modules is presented in section 2.3.

Implementation
All the experiments were trained and tested on Processor Intel (R) Core (TM) i7-2600 CPU @ 3.40 GHz, 3,401 MHz, 4 Core(s), 8 Logical Processor(s), RAM 12.0 GB, with GPU NVidia GeForce RTX3060 12 GB.We utilized the RANCZR-CLiP dataset from the national institutes of health clinical center (NIH), which is available on Kaggle and comprises adult chest X-rays images.The RANCZR-CLiP dataset comprises 30,083 manually labeled images: 24,062 images for the training set, 6,021 images for the validation set, and 3,255 images for the test set [21].All image data obtained in this experiment are stored in 3×448×448 pixels as shown in Figure 1.The number of each tube type shows in Figure 2. To evaluate our experiments on multiple tubes image classification task, we choose the three types of the most popular convolutional neural networks (CNN) as backbones to conduct this experiment, including Inception_v3 [22], Resnet200D [23], and EfficientNetB5 [24].Figure 3 shows the placing of attention module in the model of multiple-label image classification.

Attention mechanism
In this section, we will briefly introduce the concept of attention mechanism through the squeeze and excitation block (SE block).SE block was proposed by Hu et al. [10], which was an approach for enhancing the channel relationship.The authors designed the new network to increase a network's representational ability by explicitly modeling the interdependencies between its convolutional features' channels.This block could offer a feature recalibration technique that allows the network to learn to use global information to selectively emphasize informative features and suppress less helpful features.A computational structure of the SE block is presented in Figure 1.For any transformation   =  → , an input feature  ∈  ′ × ′ × ′ and an output feature map  ∈   ×  ×  . and  represent the spatial height and width,  ′ and  are the input and output channels, respectively.Firstly, the input feature  are passed through a squeeze block.Let  = [ 1 ,  2 , … ,   ] is set of a 2D spatial kernel and × represents a convolutional operator.The   provides the output feature map as  = [ 1 , , … ,   ], which is calculated using (1).
After that, the transformation output  is passed to squeeze global spatial information by using global average pooling.This step can provide the necessary contextual information.Here, the  − ℎ element of vector  is channel-wise statistics, ,  denotes spatial location.It is calculated using (2).

371
Secondary, the squeeze block is followed by an excitation block that tries to completely capture channel-wise dependencies to make use of the information accumulated during the squeeze process.To achieve this goal, these must learn a non-mutually exclusive connection.This second step utilizes the sigmoid activation, which concerns the enable several channels to be highlighted rather than one-hot activation.To reduce the complexity, two fully connected (FC) layers are used to reduce a dimension  1 and adopted ReLU and FC layer  2 to increase the dimension.Where  denotes the reduction ratio (here:  = 16),  1 ∈ ℝ   × , and  2 ∈ ℝ  ×   , δ represens the ReLU function, σ is Sigmoid function.The final output  ̃ of SE block is recalled using the multiply by , which is calculated using (4).

Head layers
This work will focus on techniques of enhancing the performance of CNN in the context of multiplelabel image classification.In the following subsections, we will briefly explain well-known and widely used modules, i.e., SA [25], pre-activation-spatial attention (PAc-SA) [26], cSE [10], sSE [13], batch normalization (BN)+sSE, scSE [13], [14], and BN+ scSE blocks.In our experiment, all these approaches will be used as benchmarks against our method.

Spatial attention block
A spatial-wise attention block ignores the channel information and equally preserves the feature of different channels. 3 × 3,1 refers to a 3×3 convolution with the one channel number of the hidden feature map,  3 × 3, refers to a 3×3 convolution with the m channel number of hidden feature map and m is integer number of output channel number needed, BN denotes the batch normalization operation [25].In our work, we used  those values [64, 32, 8] respectively.The spatial weights map is adaptively generated by this module to properly represent the spatial connection.Spatial attention can be computed as in (5).

Pre-activation-spatial attention block
To reduce the parameter number of network and increase the performance of multiple-label image classification, we introduce to used spatial attention and pre-activation residual network, namely a pre-activation-spatial attention (PAc-SA) block.The PAc-SA block has three BN activations, two ReLU activations, one 3×3 convolution with eight channel number of the hidden feature map, one 3×3 convolution with one channel number of the hidden feature map, and one Sigmoid activation.It places BN and ReLU as pre-activation units before convolution operation.The recalibrated values are multiplied by the feature map's different weights to get the output in (6).

Spatial squeeze and channel excitation block
Roy et al. [14] considered the input feature map   , cSE block is used to squeeze the spatial information of   that is passed by (1) followed by global average pooling in (2).To excite the channel information of the squeezed vector , these can be expressed in (3).The final step shows that the importance of  ̃ which is calculated using (7).

Channel squeeze and spatial excitation block
Roy et al. [13] introduced sSE, which compress the information from all channels by convolution  1 × 1 , the feature maps of  are squeezed along the channel and are excited the spatial data.Let  = [ 1,1 ,  1,2 , … ,  , … ,  , ] represent the slice on the spatial dimension, where spatial location (, ).In this experiment, we adopted the original sSE as shown in (8) and applied BN after the sigmoid function as shown in (9) for this experiment.

Spatial and channel squeeze and excitation block
Roy et al. [14] presented scSE, which declared cSE see in (7) and sSE see in (8) blocks in parallel and both outputs are added in (10).scSE could generate both channel re-scaling and spatial re-scaling.The feature maps are recalibrated to learn more meaningful feature maps, which are provided both in terms of spatial and channel relevance see in (10).For improving the performance of CNN model and attention block, we proposed BN that normalize the input feature maps and multiplied the output of cSE and sSE blocks.This could get both spatial and channel importance of each location on feature maps cSE and sSE elementwise see in (11).

RESULTS AND DISCUSSION
This section will discuss the outcomes of each experiment.Inception_v3 [22], Resnet200D [23], and EfficientNetB5 [24] architecture models were tested on training and validation data.In addition, our experiment should provide a concise and pre-cise description of the experimental results, their interpretation, as well as the experimental conclusions that can be drawn in comparison with other modules, initial baseline SA [25], PAC-SA [26], cSE [10], sSE [13], BN+sSE, scSE [13], [14], and BN+ scSE blocks.The goal of this experiment is to determine the mean of accuracy score (mAs) and mean of average precision score (mAPs) of each architectural model during the image training process.This experiment will be run using 16 training epochs.Tables 1 to 3 presents the number of network parameters, mAs, and mAPs of each CNN architecture with different attention layers.
As computes the accuracy of the model, the function returns 1, 0 if the subset accuracy is 1, if the entire set of predicted labels matches the true set of labels, otherwise it is 0. In this experiment, we trained CNN models in 5-fold validations , where  ′  denotes the predicted value of the  -th sample,   represents is the corresponding true value, and the fraction of correct predictions overs.The mean of  can evaluate in (13).
In  computation, the weighted mean of precision acquired at each threshold is used to summarize a precision-recall curve, with the increase in recall from the previous threshold used as the weight.  and   denote the recall and precision at the  ℎ threshold.The mean of  can calculate in (15).
Table 1 presents the results of the Inception_v3 with different attention blocks.cSE block got the best results from all the attention blocks, with  94.68%,  65.34%.Then, followed by the three high performances, including the scSE got  94.70%,  65.24%, the SA got  94.73%,  64.96%, and the BN+sSE got  94.68%, 66.01%, respectively.According to these findings, overall, Resnet200D with the addition of the Bn+scSE block outperforms all trained CNN architectural models by  95.87% and  72.15% see details in Table 2.The EfficientNetB5 architecture followed, with  95.24% and  68.44% see details in Table 3, and Inception_v3 with  94.68%,  65.34%.See details in Table 1, respectively.

CONCLUSION
This paper has evaluated the experiments of each CNN model with different attention blocks.In order by CNN model, Resnet200D with the improved scSE is the best architecture for multiple-label image classification on a chest X-rays dataset with multiple catheters and tubes from NIH.Then came EfficientNetB5 with improved scSE and Inception_v3 with cSE.This experiment shows the higher performance of the CNN model with the attention block than the baseline CNN model, which could improve the terms of , and .In addition, we have improved the attention module, i.e., the improved sSE and the improved scSE.They use BN to normalize the input feature maps of these attention blocks, which can reduce covariate shift and reduce the effects of exploding and vanishing gradients.

Table 1 .
Number of network parameters, mean of accuracy score, and mean of average precision score of Inception_v3 with different attention blocks Combining convolutional neural networks and spatial-channel … (Thuvanan Borvornvitchotikarn) 373

Table 2
presents the results of the Resnet200D with different attention blocks.BN+ scSE block got the best results from all the attention blocks, with  95.87%,  72.15%.Then, followed by the three high performances, including the scSE got  95.85%,  71.31%, the BN+sSE got  95.79%,  70.46%, and SA got  95.79%,  70.42%, respectively.Table3presents the results of the EfficientNetB5 with different attention blocks.BN+ scSE block got the best results from all the attention blocks, with  95.24%,  68.44%.Then, followed by the three high performances, including the scSE got  95.24%,  68.21%, the BN+sSE got  95.21%,  68.07%, and PAC-SA got  95.16%,  68.02%, respectively.

Table 2 .
Number of network parameters, mean of accuracy score, and mean of average precision score of Resnet200D with different attention blocks

Table 3 .
Number of network parameters, mean of accuracy score, and mean of average precision score of EfficientNetB5 with different attention blocks