Classification of semantic segmentation using fully convolutional networks based unmanned aerial vehicle application

ABSTRACT


INTRODUCTION
These are models whose architecture is made up of layers arranged in stacks. The models of each stage learn a useful pattern by successively filtering the input data. Figure 1 depicts a basic 4-layer neural network for the classification of handwriting from the Modified National Institute of Standards and Technology (MNIST) dataset [1]. The four components of a system for training a neural network include: i) Layers; ii) Input data; iii) Loss function; and iv) Optimizer.
Furthermore, neural networks are trained to identify the correct weight values that will specify the transformations to be performed on the incoming data [2]. Figure 2 depicts the network layers parameterization process. The setup of these weights is challenging to optimize for varied tasks because there can be millions of parameters that are interdependent. A loss function is utilized to generate a weight configuration because it measures the extent of performance of the layer representations from the network compared to the predicted output. The first stage of the backpropagation (BP) algorithm is the determination of the final loss value [3], followed by computation of each parameter's contribution in the computed loss value from the top layer to the bottom layer [4]. The loss function derivative can be determined at a given training point by identifying the required adjustments to minimize the loss and optimize the model to a given evaluation metric known as stochastic gradient descent [5], [6].
The Figure 3 shows the final representation of a neural network training loop. The weights are randomly initialized (using a random initializer) in this training loop, which results in a high loss score because the initial weights are likely unsuitable for capturing the data patterns in a usable manner. The network, on the other hand, adjusts its weights with each training batch and gradually improves until it has a small loss [7].

Dense layers
Traditionally, densely linked layers, also known as completely connected layers [8], [9], have been used for datasets other than images, in which a form of connection exists between the neurons from one layer to the other. These dense layers [10], [11], on the other hand, can only learn patterns that fall within their input feature space, while each convolution filter is capable of learning patterns locally (kernel space or region of interest) [12]- [14]. This implies the possibility of fragmenting input images into edges and textures foreasy learning; they are also more useful for classification rather than global patterns [15].

Convolutional neural networks (CNNs)
CNNs are a sort of neural network that is largely utilized in deep learning (DL) for tasks in computer vision [16]- [18]. CNNs are similar to traditional neural networks, except that rather than generic matrix multiplication [19], [20], CNNs depend on a convolution kernel that is found in one or more of the CNN network layers [21], [22]. CNNs can be implemented using image tensors as input because they can extract the relevant image features for classification or differentiation, and then output classifications [23], [24]. CNNs have found wide application because they can use relevant filters to learn translation-invariant patterns and spatial hierarchies. Translation invariant patterns learning implies that the network can identify similar patterns wherever else in the image after learning it in one region. This ensures image processing efficiency owing to the need to learn just few training samples that can be generalized [25].
Spatial hierarchies show how the network's initial levels can learn few local patterns that can be generalized into larger patterns made up of these few patterns in the subsequent layers. This enhances the capability of the network to acquire abstract and complex visual concepts. It also allows these networks to produce predictions of greater accuracy rather than just vectorizing a complicated image with heavy reliance on pixels. To track these key properties and interactions between features, CNNs must be used in any image classification operation [26].

Convolution
Convolution is a procedure that combines two functions of a real-valued parameter to create a new function. Let w(a) represent the weighting function that prioritize recent measurements, s(t) represent the time-based output estimate function, and x(a) represent the time-based input position function. Based on these defined functions, the generic equation for convolution is shown in (1) and Figure 4 shows visual representation of dimensional convolution [27]. depths as an output. The output feature map depth, which indicates the number of filters built by the layer and each of which can encode a different input data feature, is one of the two key criteria for creating these convolutions, coupled with the size of the extracted patches [29].
The computation of 2D convolutions is done by moving a square window of defined width and height across all input feature map pixels. For images, the input feature map has three dimensions which are height, breadth, and color bands (corresponding to red, green, and blue). When these feature maps are subjected to 2D convolution, -2-D patches of the surrounding features are created. Then, these patches are turned into a 1-D vector that reflects the output depth using a convolution kernel. After that, the vectors are spatially reconstructed into a 2-D output map that corresponds to all the input map's locations. Figure 4 depicts the process of this convolution [30].

Strides
The stride of the convolution is a characteristic that can determine the size of the output in CNNs. Stride is a convolution operation parameter that specifies the distance between the derived patches from the input feature map. A stride of 2 implies that the output feature map's width and height are downsized by a factor of 2 without padding as shown in Figure 5

Semantic segmentation
Each pixel in an image is given a class label using the image classification approach known as semantic segmentation. For aerial imagery, semantic segmentation for use in a feature extractor is carried out to locate pertinent areas of an image that may point to invariance between seasons and times of the day. Even though fully convolutional networks (FCNs) can be used for this task, precise/accurate image segmentation and localizations learning necessitates a large amount of data. Due of the lengthy time required to manually label satellite images with many classifications, there are not many datasets with extensive pixel-by-pixel labels that are publicly available. Hence, FCNs and U-Nets are commonly employed in aerial imagery segmentation to address both issues [32].

Fully convolutional networks (FCN)
The advent of a variant of the CNN known as the FCN represents a big advancement in picture segmentation. The difference between FCN and the traditional CNN is that in the CNN, the completely linked terminal layers are transformed into convolution layers, resulting in the creation of a nonlinear filter for each output vector layer in the network. As a result, the completed network can function on inputs of any size and produce outputs with the same spatial dimensions. Hence, the classification network can generate a heatmap of the selected item class. The addition of layers and a spatial loss to the network results in an efficient scheme for end-to-end dense learning. Figure 6 depicts an example of this transformation [33].
FCNs are not only more flexible because they can take a variety of input image sizes, but they've also been shown to be more efficient for learning dense predictions thanks to in-network up-sampling. The FCN may also keep track of the input's spatial information, which is important for semantic segmentation because it requires both classification and localization. Although an FCN can accept any size input image, non-padded convolutions are used to decrease the output resolution. These were created to keep filter sizes modest and reduce the computational demands. The outcome is a coarse output with a size reduction equal to the pixel stride of the output units' receptive field [34], [35].

RESULTS AND DISCUSSION
The unmanned aerial vehicle (UAV) system benefits from this model application especially with the limited resources of UAVs system. Fully autonomous UAVs need to understand their surrounding environments in detail. Always the surouding objects are in 3D while token pictures were shown in 2D which is critical for high-level decision-making. For example, always dorones need to understand their surrounding environment in real time; therefore, real-time semantic mapping is significant and worth exploring in this type of drone application. Such application and processes will be effective for most important factor of UAV which is the power resources, so semantic segmentation will reduce the time and used power to process video images or frames in order to reach a final decision [36].

CONCLUSION
Using FCN has been shown to be not only more flexible as images can be obtained with different input sizes, but also more efficient as the network uses upsampling to learn dense predictions increase. The input's spatial information can also be preserved using FCN, especially for semantic segmentation applications. This task requires both classification and localization, hence, any image size can be used as an input with FCN, even though the output resolution is negatively impacted by convolution without padding. Hence, these were introduced as a way of reducing the computational demands and filter sizes. The result is a coarse output that is reduced in size by a factor equivalent to the pixel pitch of the incoming field of the output unit.