Segmentation and yield count of an arecanut bunch using deep learning techniques

Arecanut is one of Southeast Asia’s most significant commercial crops. This work aims at helping arecanut farmers get an estimate of the yield of their orchards. This paper presents deep-learning-based methods for segmenting are-canut bunch from the images and yield estimation. Segmentation is a fundamental task in any vision-based system for crop growth monitoring and is done using U-Net squared model. The yield of the crop is estimated using Yolov4. Experiments were done to measure the performance and compared with benchmark segmentation and yield estimation with other commodities, as there were no benchmarks for the arecanut. U-Net squared model has achieved a training accuracy of 88% and validation accuracy of 85%. Yolo shows excellent performance of 94.7% accuracy for segmented images, which is very good compared to similar crops. This is an open access article under the CC BY-SA license.


INTRODUCTION
Agriculture is humanity's primary source of income and a critical element of each country's economy.As a traditional occupation, agriculture is the backbone of the Indian economy.A strong farming industry offers national food safety, an origin of revenue and job opportunities.Precision agriculture could help with this.Precision agriculture can ease many rising environmental, economic, market and societal problems [1].Precision agriculture technologies are predicted to positively influence agricultural output concerning two issues: lucrative for farmers and eco-friendly environmental advantages for the general population.Precision agriculture aims to maximize profit, reduce cost and reduce environmental harm by tailoring agricultural techniques to the location's needs.As a result, agricultural and engineering companies are creating cutting-edge machine vision technology to aid farmers in precision farming.Attention-based farming is a crop supervision method aiming to find the required type and quantity of inputs to the actual crop yield for tiny areas in a farm field.Precision agriculture's financial and eco-friendly benefits may be found in the decreased use of aqua, fertilizers

RELATED WORK
Segmentation is a significant step in a machine vision system used for the analysis or interpretation of an image.Its achievement primarily impacts the conduct of the whole vision system.Image segmentation is a perceptual grouping of pixels based on similarity and proximity [3].Automated segmentation is essential as manual segmentation is complicated, consumes time, and is subjective and error-prone.Most existing segmentation techniques focus on a two-class classification approach, i.e. object and background.Background elimination is the primary step and must be done most suitably to avoid misclassification.Segmentation is complex because the color of the crop, shadows, and inter-reflection varies as the illumination changes in the outdoor field.Segmenting childish crops is much more problematic as it is green and resembles the background foliage.Slight variations in different parts of a single crop bunch increase the complexity of crop segmentation.Despite the above limitations, color-based segmentation also has its superiority.Color is the most potent visual cue to discriminate an object from the background.Also, color is primarily unchanging to transition in size, orientation and occlusion [4].Color-based methods are mostly classified into two categories: pixel-based and region-based methods.Excess green minus excess red index (ExGR) color model, a pixel-based approach, demonstrated better results compared to other color models for the green vegetation segmentation [5].
A good many yield estimation techniques require image segmentation/detection, which includes mango crop yield estimation using (red, green, blue (RGB) and Y is luma (brightness) (YCbCr), Cb is B-Y, and Cr is R-Y) color space and texture information based on pixel adjacency [6], detection of red apples using hue saturation value (HSV) and green apples using hue saturation intensity (HSI) profile [7], threshold-based segmentation of reddish grapes using the Otsu threshold applied to the H layer of HSV color space gave better results compared to the histogram and linear color model, Bayesian classifier and Mahalanobis distance methods [8].Pixel-based methods are easy and efficient but incorporate noise.Researchers focus on region-based strategies mainly based on edge detection and shape fitting, which includes apple segmentation [9], cotton detection [10], maize tassel segmentation [11], apple identification using thresholding, edge detection, circular Hough transforms, clustering and K-nearest neighbours classification of color and texture features [12].Citrus detection using HSV space follows thresholding and watershed segmentation [13].Rice grains are segmented by converting RGB image into lab color space followed by clustering and graph-cut segmentation [14].
An interactive arecanut bunch segmentation using maximum similarity-based region merging (MSRM) gave better results than thresholding, clustering, and watershed [15].Arecanut bunch segmentation using active contours [16], YCgCr color model [17], different color models [18], and deep learning techniques [19] are the few attempts for the crop areca.The human visual system often combines more than one visual cues to enhance perceptual performance.Segmentation is better when we combine more than one feature [20].

❒ ISSN: 2252-8938
Mango segmentation combining color and texture features can be found in [21].Often, things have a particular shape.Segmentation based on shape and size is often desired.Support vector machine-radial basis function (SVM-RBF) classifier and density-based spatial clustering of applications with noise (DBSCAN) based grape bunch detection using histograms of oriented gradients (HOG) and local binary pattern (LBP) shape and texture descriptors [22], active contours and scale-invariant feature transform (SIFT) based segmentation of tomatoes using shape and position information [23] are attempted in this direction.Region information is combined with gradient information to approximate the elliptic shape of the tomato boundary.Machine learning (ML) based segmentation approaches include segmenting matured grape bunches by finding edges, then determining circles and then classifying them as background and grapes using support vector machines [24].Segmentation using those hand-engineered features is less powerful.Investigators focused on deep learning-based approaches.Though better results were obtained for apple segmentation using multi-scale multi-layered perceptrons (MLP) and convolutional neural network (CNN) and yield estimation using watershed and circular Hough transform [25], they are susceptible to occlusion and illumination.Multiclass (fruit, leaves and branches) almond fruit segmentation using feature learning with a conditional random field (CRF) [26] automatically generates the set of rules from the data rather than make use of pre-defined feature descriptors.That unsupervised feature learning approach automatically captures the most appropriate features from the data.Mango counting applies MagnoNet, a deep CNN-based model, with a contour-based connected object detection model, which gives better results.It is invariant to illumination changes, scaling, contrast and occlusion [27].

SEGMENTATION
The section describes the framework of U2-Net for segmentation of arecanut bunch from the input image eliminating the unwanted background information.The architecture of U2-Net shown in Figure 1 can be conceptually viewed as an encoder accompanied by a decoder framework.Multiple U-Net-like arrangements are stacked together to construct flow models and compiled as (U-n Net), n represents a number of U-Net units.The challenge is the increased costs of memory and computation by n times.In this framework, each encoder-decoder U-Net stage includes a ReSidual U-block (RSU), a down and up sampling encoder-decoder.The purpose of this block is to use residuals, a multi-scale features in place of original features.According to researchers, this will introduce the desired effect of keeping the fine-grained details which enforce the network to derive features at multiple scales from a residual block.For example, En 1 is one RSU block.As illustrated in Figure 1, U2-Net consists of three main parts: five stages of the encoder, decoder, a salience fusion unit and the last stage consists of the encoder and a fusion unit.
Encoders En 4, En 3, En 2 and En 1 uses residual blocks RSU-4, RSU-5, RSU-6 and RSU-7 respectively.Digits 7, 6, 5 and 4 represent the height (L) of RSU blocks and are customized by the spatial resolution of the input feature maps.Large L has been used to represent more information about feature maps with large resolutions.The feature maps resolution in En 6 and En 5 are fairly low; further down-sampling of those feature maps results in a loss of contextual information.Hence, a dilated version RSU-4F ("'F"' represents dilated version) is used in both En 6 and En 5. Therefore, all the intervening feature maps of RSU-4F posses the resolution of input feature maps.For En 6, decoder stages possess information related to symmetrical encoder stages.In De 5, the dilated version RSU-4F has been used and is indistinguishable from the one used in En 6 and En 5.The input for each decoder is the sequence of up-sampled feature maps of its preceding stage with its symmetrical encoder phase.The saliency probability map is generated using the final fusion module of the saliency map.Indistinguishable to Holistically-Nested Edge Detection [28], U2-Net begins generating six side output saliency probability maps S   side and w f use are all adjusted to 1.The network has been trained using Adam optimizer, and all the hyperparameters are adjusted to default initial values (weight decay=0, learning rate lr=1e-3, eps=1e-8 betas= (0.9, 0.999)).The model has been trained for about 20 hours with a batch size of 12, and the loss converges after 5k iterations.Images are resized to 720×720 during testing and fed to the network to generate saliency maps.The predicted 400×400 saliency maps are resized to the original input image of 720×720.Bi-linear interpolation has been used for the resizing process.The model learning curve concerning dice, jaccard and loss for training and validation is shown in Figure 3(a), and the accuracy plot is shown in Figure 3(b).A summary of the training performance is depicted in Table 1.The model has achieved training and validation accuracy of 88% and 85%, respectively.

Performance analysis
Experimentation has been carried out to determine the segmentation performance using both the data sets: ripe and unripe.Four standard measures, namely: precision (Pr), recall (Re), F1-score (F1), and IoU given by ( 1) to (4), have been used to judge the accomplishment of the model.There are minimal attempts made for arecanut segmentation [16]- [19], and the evaluation of achievement has been concluded for very few images [15], [16].Table 2 summarizes the test performance of the segmentation.The higher values indicate more remarkable performance.The appropriateness of this method is evidenced by its high segmentation performance against both ripe and unripe data sets that differ in terms of color.The sample segmentation outputs achieved by U2-Net model for both ripe and unripe images is shown in Figure 4.The results obtained are better than other methods.The model has been implemented using Pytorch 0.4.0.The entire training and testing is done using an octa-core, 16 threads PC of an Intel(R) Core(TM) i5-10300H CPU with 2.50 GHz, 16GB RAM and an NVIDIA GeForce RTX 2060 GPU (6 GB memory).

P r =
T r P T r P + F l P (1) Int J Artif Intell, Vol.

YIELD COUNT OF ARECANUT
The objective is to build an object detection model that can count the number of areca nuts in a given image.Object detection methods act as a fusion of image categorization and object finding.It generates one or more bounding boxes and labels each bounding box.These methods can deal with multi-class categorization, localization, and objects with many incidents.Different kinds of object detection include Retina-Net, singleshot multiBox detector (SSD) and Fast RCNN.These methods can address the challenges, such as limitation of data and object identification modelling, but need to be able to identify the objects in a single algorithm pass.You only look once (YOLO) has gained popularity for its higher performance over other object identification methods.Yolo [29]- [31] merges the classification phase and region proposal network (RPN) into one network, resulting in a more compact object identification model with more excellent computational order, making them suitable for instantaneous applications.Compared to earlier region proposal-based detectors [32] and [33] detect objects in two stages, Yolo forecasts the bounding boxes and the corresponding class label in one run ❒ ISSN: 2252-8938 using a single feed-forward network.Yolov2 [30], the second kind of Yolo [29], was presented aiming at the considerable enhancement of performance and speed.Faster R-CNN spurred the introduction of anchors for detection in Yolov2.The anchors increase detection performance, reduce challenges, and simplify the network training process.Batch normalization [34] was introduced to the convolution layers in the meantime, pushing mean Average Precision (mAP) to 95.14% and skipping connection [35].The recall and localization performance of Yolov2 was enhanced when compared to Yolo.Yolov3 [31] builds on Yolo, and Yolov2 became one of the modern techniques for object identification.Yolov3 uses multi-class classification and binary crossentropy to calculate the classification loss instead of mean square error.As shown in Figure 5, Yolov3 [36] uses logistic regression to foresee objects in three distinct scales (similar to FPN24) and the result for each bounding box.DarkNet-53 is used instead of DarkNet-19 as a new attribute extractor.The DarkNet-53 is a chain of 53 convolutional layers with a dimension of 1×1 followed by filters of size 3×3 with skip connections.Compared to ResNet-152, DarkNet-53 has low billion floating point operations (BFLOP), yet it is two times faster with classification performance than that ResNet-152.Yolov3 improves significantly on small object detection performance.Alexey et al. [37] recently introduced Yolov4, the next version of Yolov3.With comparable performance, it runs twice as fast as EfficientDet.In Yolov4, the average precision and frames per second were enhanced by 10% and 12%, respectively.The CSPDarkNet53 backbone, spatial pyramid pooling (SPP) extra block [38], path aggregation network (PANet) neck [39], and Yolov3 head make up the Yolov4 framework.With Mish [40], CSPDarkNet53 improves CNN's learning capacity.The SPP is used with the CSPDarkNet53 to considerably extend the receptive field, segregate the most relevant context features and reduce network operating speed to nearly nothing.Instead of the FPN in Yolov3, PANet is used in YOLOv4 to collect feature maps from various stages.Yolov4 allows for broader use of traditional GPUs while enhancing the performance of the classifier and detector.This study uses the label what you see (LWYS) procedure to recognize arecas in complicated environments using a modified Yolov3 model named Yolo-areca model.The addition of dense architecture [41] into Yolov3 to assist the reuse of attributes for more generalized areca identification and SPP application to lower the error and increase the accuracy are among the ideas put forward to reduce the disadvantages of deep learning and to make detectors intelligent as humans.

Yolo-areca model
The Yolo-areca model replaced the blocks 8*256 and 8*512 in Yolov3 shown in Figure 5 with a dense architecture for enhanced feature reuse and characterization.One convolutional layer and a 1*1 bottleneck layer were put together for every thick layer to enable accurate detection of tiny areca in various settings.The con-Int J Artif Intell, Vol.[42] with FDL*3 was used to activate the yield count of segmented images.All the layers of Yolov3 were pruned as follows: Yield count in unsegmented images is triggered using Mish [40] with FDL*1; Yield count in segmented images is activated using Mish with FDL*3, and SPP [38].Mish, which is interpreted as f (x) = x.tanh(c(x)),where c(x) = ln(1 + e x ), the softplus triggering function was found to beat ReLU.Introducing this activation function significantly improves every deep neural network's performance.SPP [38] was launched following the last residual block to optimise the network topology.The feature extraction capability is strengthened as the convolutional layers deepen and the receptive field of a neuron increases.However, if the shape feature map of the arecas is obscured, the location details of the small areca become erroneous or even forgotten in some cases [43].Because of more arecas in the image, there will be missed detections and lower accuracy.As a result, the SPP module can resolve the issue.The model was trained and tested on architecture with the following configuration: Intel Xeon(R) 64-bit 2.3 GHz CPU, 16 GB RAM, NVIDIA Tesla T4 GPU, CUDA v11.2, cuDNN v7.6.5.Images with a resolution of 416×416 pixels are fed into the model.Training loss is reduced when the learning rate is adjusted [30].The rate of learning was adjusted to 0.001 for 4000 iterations with a maximum batch size of 6000 ripe and unripe areca.Batch and subdivision were set to 32 and 16 correspondingly to lower memory usage.Momentum and decay rates were set to 0.949 and 0.0005, respectively.Yolov4 has been trained using pre-trained weights.

Training
The model has been trained separately using 1017 segmented and 1017 unsegmented images; 80% were used for training, and the remaining 20% were used for validation.All the images are resized to 1920×1080 resolution.The label for each class and the coordinates of all the bounding boxes of ground truth images are required for training [29]- [31].All ground truth bounding boxes were labelled using the graphical image annotation tool.Labelling each areca in an image is done by a bounding box using the LWYS approach shown in Figure 6.In particular, the bounding boxes for the heavily occluded arecas were drawn by a presumed shape using human intelligence.Three people verified the labelled images to confirm we have done annotations correctly.Four standard measures, namely: Precision (Pr), Recall (Re), F1-score (F1) and IoU given by (1) to (4), have been used to judge the accomplishment of the model.Table 3

Results and comparision
The goal is to present an efficient and accurate technique for counting arecanuts in a bunch from an image acquired in field conditions.The trained model has been tested, and the mean absolute percentage error side from En 6, De 5, De 4, De 3, De 2 and De 1 using 3×3 convolution layer, a sigmoid function and then up-samples the saliency maps abovementioned to the size of the input image.Further, fuse all the saliency maps using a concatenation operation accompanied by a 1×1 convolution and a sigmoid function to produce the finishing saliency probability map S f use .To summarize, U2-Net is built on RSU blocks with no pre-trained backbones.It permits deep networks with opulent multi-scale features with comparatively low computing and memory costs.

3. 1
. Training Deep learning-based techniques need vast data sets and proper labelling, thereby longer training and less testing time than other machine learning-based methods.It is not feasible to collect enormous amounts Int J Artif Intell, Vol. 13, No. 1, March 2024 : 542 -553 ISSN: 2252-8938 ❒ 545 of data for training.The training data size can be increased using a technique called data augmentation by applying transformations to the database images.ImageDataGenerator class of the Keras library has been used for data augmentation.The labelling task is difficult and requires experts to annotate input images.The maximum current annotation tools are based on polygonal approximation of the object boundaries.The object boundary in each input image is encoded as a mask, a set of polygon points.Annotations were completed using the Labelme tool, and each annotation was stored as a JSON file.JSON was converted to binary images using labelme2voc-white representing the arecanut area and background represented by the black.A sample input image, labelling and masking are shown in Figure 2.

Figure 3 .
Figure 3. Model performance: (a) learning curve and (b) accuracy plot

Figure 4 .
Figure 4. Representative results of segmentation

Figure 5 .
Figure 5. Architecture of Yolov3 summarizes the training performance.The higher values of the above measures indicate a more fantastic version of the model.The model performs better for segmented images because most unwanted background information has been eliminated during segmentation.

Table 1 .
Segmentation performance calculated over validation set

Table 2 .
Segmentation performance calculated over validation set 26*26*768 and 13*13*384 in the FPN of the Yolo-areca model increase to 26*26*2816 and 13*13*1408 features.The Yolo-areca model was trained separately with segmented and unsegmented images to find a more accurate and faster Yolo-areca real-time detection model.Leaky rectified linear unit (ReLU)

Table 3 .
Training performance