Advanced mask region-based convolutional neural network based deep-learning model for lung cancer detection

ABSTRACT


INTRODUCTION
Lung cancer is a foremost origin of bereavement amid people with cancer worldwide.The current status of medical research is challenged by the statistic that cancer is one of the supreme feared diseases in the world.The amount of cancer cases worldwide has been increasing, and lifestyle factors greatly contribute to this pattern.Pneumonodular growths, which are tiny cell masses that develop inside the lungs, are particular of the symptoms of lung cancer.It might be challenging to discriminate amongst pulmonary knobs that are malignant (cancerous) and benign (non-cancerous).Because they have the prospective to travel to other regions of the body and intensify, malignant lung nodules must be found and treated as soon as feasible.Doctors must consider numerous aspects, including the patient's medical history and symptoms, the form, size, location, and texture of the nodules, in mandate to discriminate amongst malignant and benign lung nodules.Doctors use a variety of diagnostic techniques to assess pulmonary nodules, including computed tomography (CT) scans, which produce detailed imageries of the lungs and show the morphology of the nodules, positron emission tomography (PET) scans, which quantify the metabolic motion of the nodules and show in what way they are developing, and needle biopsy, which entails taking a sample of tissue from the nodule and examining it under a microscope.With the usage of these techniques, it is possible to determine the most effective course of action by estimating the likelihood of malignancy for each lung nodule identified.
The enlargement of nodules, which are unnatural cell growths that happen in the lungs, is one amongst the means that lung tumors can begin.CT scans of the lungs show these nodules.A person can have single or many nodules in one or both lungs, conditioning on the cause and type of the nodules.Some of the common causes of lung nodules are respiratory diseases, such as pneumonia, tuberculosis or cancer.The features of the lung knobs can aid to ascertain the likelihood of them being malignant or cancerous.Some of these characteristics are the quantity of nodules (single or multiple), shape, and location of the nodules; and the modifications in the nodules over time.In imperative to distinguish between benign and malignant nodules, several researchers have been striving to create ways to recognize and categorize lung nodules based on these characteristics.
For the purpose of identifying lung cancer nodules from CT scans, Venkatesh et al. [1] offer a unique method that combines Otsu thresholding and the cuckoo search algorithm.The identified nodules are then classified as either benevolent or malicious using a CNN.They proclaim that their technique outperforms existing approaches based on element swarm optimization and genetic algorithms by achieving an accurateness of 96.97%.Feng et al. [2] use a deep learning strategy to distinguish peripheral and nonperipheral lung cancer imageries using the ResNet50 model and feature pyramid network algorithm.To make the characteristics less dimensional and boost the model's performance, they apply Fisher coefficients.They claim accuracy of 97.2% and a dice coefficient of 0.986, which are comparable to state-of-the-art techniques.Using Mask R-CNN, a well-liked framework for example segmentation, Cai et al. [3] propose a methodology for lung 3D image identification and segmentation.To aid in investigation and diagnosis, they also employ a ray-casting volume rendering approach to visualize the segmented lung cancer knobs in 3D space.They assess their methodology expending two datasets: the Ali TianChi challenge datasets, which are separate datasets from a lung cancer prediction competition and the LUNA16 dataset, that is an openly accessible dataset for lung knob research.On these datasets, they attain a sensitivity of 88.1% and 88.7%, respectively.Additionally, they offer specialized 3D models of the lungs for pulmonary nodule detection that can be needed to analyze the nodules' spatial distribution and form.Another publicly accessible dataset for pulmonary image interpretation is the LIDC-IDRI dataset, which Liu et al. [4] utilize to segment lung nodules.As a support job for nodule segmentation, they use block-based classification learned on the COCO dataset.By giving the model more data and direction, they oppose that this methodology can upsurge the segmentation's robustness and accuracy.On the LUNA16 dataset, Kopelowitz et al. [5] use Mask RCNN to conduct lung nodule detection and segmentation.They assess the nodule size by comparing the expected volumes to the actual volumes, which is a crucial step in the diagnosis and planning of lung cancer treatment.They demonstrate that the technique can measure nodule size reliably and offer helpful data for clinical decision-making.Mask R-CNN is used by Ter-Sarkisov et al. [6] to detect COVID-19 in chest X-ray images.In mandate to accomplish the segmentation without balancing the dataset, which is a prevalent problem in medical image analysis owing to statistics scarcity and class imbalance, they adopt COVID-CT-MaskNet's approach, which is a modified version of Mask R-CNN.They outstrip the currently used approaches for COVID-19 detection, achieving accuracy of 91.66% and sensitivity of 90.80%.The YOLO-v5 technique is applied by Liu et al. [7] to identify lung nodules from CT images.The YOLO-v5 object recognition technique can process photos in real time and is quick and effective.They achieve multi-scale feature fusion using the BiFPN structure and stochastic-pooling to highlight key characteristics in the feature records, which can advance recognition performance and accuracy.A dual head network (DHN) with two branches-one for nodule recognition and one for nodule classification-has been proposed by Tsai et al. [8].They demonstrate that, in terms of classification accuracy and recall, DHN can outperform single head networks.
Worldwide, non-communicable diseases like cancer pose serious threats to public health.CT scans can identify the presence of malignant tissue, which can help save thousands of lives each year only if caught early and treated.Depending on the type of malignancy identified, the quality of the images used, and the specific image processing techniques being employed, the efficacy of the many methods for diagnosing cancer via image processing can vary.Highly consistent prototypes are essential for the speedy, accurate, and spontaneous diagnosis of lung cancer.The methodology adopted here in this endeavor is based on accurate recognition and cataloging of lung cancer images which indicates that the nodule and categorization can be reliably identified.A sophisticated Mask RCNN architecture that displays promising results has been established to lever these problems using image processing and deep learning approaches.In brief, the ability to learn defect structures, hyper parameter tuning, and loss function customization are the reasons why the suggested architecture performed better on the considered dataset.The multi-scale feature fusion further improved the models performance.

MATERIALS AND METHOD 2.1. Materials
An assortment of digital visuals is mentioned as a dataset, and it is utilized by the developers to develop, test, and evaluate their models and algorithms.In directive to gain insight from the instances in the dataset, the model is created and trained.In this learning, we utilize the LUNA16 challenge dataset, which focuses on a thorough analysis of automated nodule detection methods on the LIDC-IDRI dataset, a different dataset which comprises lung pictures with annotated nodules [9]- [12].

Method
Object detection is a procedure that comprises computer vision and image processing which aims to recognize and locate the objects of interest in images and videos [13]- [18].For example, in medical images, object detection can be exploited for identifying abnormal cells, fractures, and other issues automatically.Cancer detection has two steps.First is the the target feature extraction, and second is Classifying and positioning the objects.Assigning a group of class labels to the items of interest in the input image is the main phase in the technique of image classification [19]- [20].Mask R-CNN has numerous aspects that influence its performance [21]- [25], like the magnitude of the input image, the threshold for intersection over union (IoU) between predicted and ground truth boxes, and the definite quantity of regions of interest (ROIs) per image that are proposed by the model.To use Mask R-CNN, the model architecture is built and loaded with the pre-trained weights from a dataset like COCO or ImageNet.Then, customized images are loaded to the model for training or inference.The model will then output info approximately on each target that it detects in the image as exposed in Figure 1 and Figure 2 recapitulates the algorithm.Preprocess the Input images 3.
Define the backbone network and the Region Proposal Network (RPN) 4.
Define the Mask Head, boundary headand the class head 5.
Initialize the model using the backbone and the heads 6.
Learn the features through training 7.
Compute the losses and Predict the mask for each Input image 8.
Perform post processing and refine detection if necessary 9. End

Figure 2. Pseudo code of advanced Mask RCNN algorithm
A common choice of backbone linkage is a residual network (ResNet) with a feature pyramid network (FPN), which creates a hierarchy of feature maps at diverse rules and levels of abstraction.This allows Mask R-CNN to handle objects of varying sizes and shapes.The overall organization of the Proposed Architecture used in this line of work can be illustrated by Figure 3.The backbone network is the base convolution system that excerpts characteristics from the input image.Mask R-CNN uses backbone networks, such as ResNet101 or ResNet by feature pyramid network (FPN).FPN is a technique that creates a feature pyramid with multiple levels of features with different resolutions and scales.This helps to detect objects of different sizes more accurately.The region proposal network (RPN) is a fully convolution network that takes the structures from the backbone network and generates a group of RoIs that are likely to contain objects.The RPN uses a sliding window approach over the feature map and applies a 33 convolution followed by two sibling 11 convolutions: one for forecasting the abjectness score (how likely the region contains an object) and one for forecasting the bounding rectangle coordinates (the size and locality of the region).The RPN uses anchor rectangles with diverse rules and characteristic ratios to generate multiple proposals for each sliding window.In a layer called RoI Align, the extracted features are matched up by the input RoIs.When using RoI Pooling (an earlier method utilized in Faster R-CNN), which quantizes RoIs into discrete grid cells, it addresses a misalignment issue that arises.The mask branch generates K m x m masks for each RoI, one for every K classes, where m is the mask resolution (for instance, 2828).The ground actuality class-specific mask is the only one considered during training because the mask branch only creates masks for positive RoIs (those that have an IoU overlap with a ground truth box above a threshold).

Figure 3. Overall organization of the proposed advanced Mask RCNN architecture
After applying a softmax function to each pixel of the K masks, one mask is chosen centered on the projected class to obtain the final result.The proposed architecture components are described in Table 1.The output mask generation can be depicted as shown in Figure 4.  1183 and remove the non-tumor regions.This will reduce the effect of unused regions and improve the accuracy of tumor segmentation; ii) Apply diverse feature abstraction linkages for Mask R-CNN, such as ResNet101.This will affect the convergence of the loss values and the extraction of features from the CT images; iii) Smudge optimizer for Mask R-CNN, such as SGD.This will affect the segmentation accuracy and training time of the model; iv) Use a customized loss utility that combines both pixel-wise and region-wise losses.This will balance the trade-off between precision and recall and optimize the detection performance; and v) Applying image amplification procedures like rotation, scaling, flipping, and cropping aids to upsurge the diversity and sturdiness of the training data.This will help overcome the challenges of limited and imbalanced data in the dataset.

RESULTS AND DISCUSSION
Mask R-CNN is a state-of-the-art model for target spotting and segmentation that builds on the Faster R-CNN framework.It adds a parallel branch to the extant ramification for predicting bounding boxes and class tags that is accountable for spawning a binary concealment for every perceived object.In other words, Mask R-CNN = Faster R-CNN + FCN, where FCN stands for fully convolutional network, a kind of neural structure that can process images of any size and output spatial maps of features and predictions.For this work, Mask R-CNN is further disciplined along a custom dataset with an 80:20 split, meaning that 80% of the procurable images are exploited in training and the remaining 20% for testing.To determine whether an anchor box (a predefined box with a certain shape and size) contains an object or not, an intersection over union (IoU) threshold is used.IoU is a metric that measures the overlap between two regions, such as an anchor box and a ground truth box.If the IoU is above the threshold, the anchor box is considered as positive; otherwise, it is negative.Figure 5 shows the AP achieved on the considered dataset.Figure 6 depicts the DICE value scored on the considered dataset.The loss function of Mask R-CNN is an aggregation of three losses: grouping loss, bounding rectangle regression loss and mask loss as given in (1).The classification loss and boundary rectangle regression loss are the same as in Faster R-CNN.The concealment failure is defined as the intermediate dual cross-entropy forfeiture over the elements of the predicted mask and the ground actuality concealment for each region of concern and every class.The whole loss is given by the summation of these three losses, they are: i) Lcls: The classification loss, same as Faster R-CNN.This is computed using a categorical crossentropy loss between the anticipated and true class labels of the regions of interest (RoIs); ii) Lbox: The boundary rectangle loss, same as Faster R-CNN.This is computed as a smooth L1 loss between the anticipated and true matches of the boundary boxes; and iii) Lmask: One binary mask with resolution m*m for every K classes is encoded by the binary mask loss.This is estimated as a binary cross-entropy loss among the expected and actual pixels in the masks.L = Lcls + Lbbox + Lmask. (1) Figure 7 displays the sample detection of nodule with the probability of detection being a nodule.To measure how well an object detection model performs, we need a metric task that takes into contemplation both the accuracy of the classification and the correctness of the localization.One such metric is the mean average precision (mAP), that is sustained by the average precision (AP) of each object class.AP is computed exploiting the intersection over union (IoU) among the expected boundary box and the actual boundary box.IoU is a magnitude relation that computes how much the two boxes overlap, ranging from 0 (no overlap) to 1 (perfect overlap).A higher IoU means a better localization.To determine whether a prediction is correct or not, we need to set a threshold for the IoU.For example, if we set the threshold to 0.5, then any prediction with IoU greater than or equivalent to 0.5 is considered a true positive (TP), meaning that the model correctly detected and classified the object.Any prediction with IoU less than 0.5 is considered a false positive (FP), signifying that the model either detected a target that did not exist or wrongly classified it.Similarly, any ground truth object that the model failed to detect is considered a false negative (FN), meaning that the model missed an existing object.A true negative (TN) is any portion of the representation where there is no object and the model did not predict any.
Recall and Precision are cardinal metrics which are derivatives from TP, FP, FN, and TN.Precision processes as to how many of the predicted objects are correct as in (2), while recall processes to compute how many of the actual objects are detected as given by (3).They are defined by the pursuing formulas.Precision = TP / (TP + FP) Recall = TP / (TP + FN) To compute AP for a bestowed target class, we plot a precision-recall curve for different IoU thresholds and calculate the region below the curve (AUC).The higher the AUC, the better the AP.The formula for F1 score is: F1=2*((precision*recall)/ precision+recall) F1 score is a measure that combines precision and recall, which are ratios of true positives, false positives, and false negatives as specified by (4 In conclusion, proposed architecture's higher performance on the Luna 16 dataset can be attributable to its capacity to learn abnormality features, hyper parameter adjustment, and customized loss function.Its exceptional performance was further enhanced by the multi-scale feature fusion using skip connections.The outcomes undeniably demonstrate that the suggested approach outperforms the current cutting-edge methodologies.

CONCLUSION
Mask R-CNN has applications in various fields, such as computer vision, medical imaging, autonomous driving, and more.Researchers and practitioners keep extending the model for different tasks, including panoptic segmentation, where instances are segmented and labeled by category.The exponential proliferation of malignant tissues in the lungs characterizes lung malignancy, a dangerous condition.The ability to save thousands of lives depends heavily on the primal spotting of lung malignancy.The effectiveness of the various methods for detecting cancer using image processing can vary relying on the form of malignancy being found, the standard of images that are utilized, and the particular image processing procedures being used.The quick, precise, and spontaneous identification of lung cancer needs highly consistent prototypes.The framework utilized in this effort demonstrates the reliable recognition of the nodule and classification.The outcomes in this effort are shown using Mask-RCNN by backbone Resnet101.The values obtained are 99.32%Accuracy and 99.45% for Mean DICE respectively.Collaboration and data exchange between research organizations and healthcare professionals might speed up the progress of lung cancer detection.Further future systems for diagnosing lung cancer will be reliant on the integration of numerous data sources, advancements in deep learning techniques, collaboration, and extensive clinical validation.While Mask R-CNN performed remarkably well, it wasn't exempt from challenges such as computational complexity, training data requirements, and the trade-off between speed and accuracy.

Figure 7 .
Figure 7. Nodule detection with probability score

Table 1 .
Proposed architecture components Component Description Backbone networkA basic convolutional network that analyses the input data and extracts features.Region proposal network A full convolutional communication system that generates regions of interest (RoIs) that are likely to contain objects.RoI alignA layer that aligns the extracted features with the input RoIs.Mask representationA branch that works in tandem with the grouping and regression with bounding boxes branches to forecast binary masks for every RoI.

Table 2 .
F1 score is good when both precision and recall are high, and it is bad when either of them is low.F1 score can be weighted to give different importance to precision and recall.The results achieved by the implementation of general Mask RCNN and the proposed advanced Mask RCNN specifically for lung cancer detection is depicted in Table2.Comparison of general Mask RCNN and proposed advanced Mask RCNN ).It is calculated as the harmonic mean of precision and recall, not the Int J Artif Intell ISSN: 2252-8938  Advanced mask region-based convolutional neural network based deep … (Bhavani Krishna) 1185 arithmetic mean2.