Aerial image semantic segmentation based on 3D fits a small dataset of 1D

ABSTRACT


INTRODUCTION
There is currently a scarcity of aerial imaging data with pixel-level annotations for semantic segmentation.This is because labeling several types of objects in high-resolution images is a time-consuming process.This section will go over an initial method of manually hand-labeling data, the final datasets that were used, and the necessary preparations [1].
The Semantic Drone Dataset focuses on improving the safety of autonomous drone flying and landing processes by using semantic comprehension of urban settings.More than 20 buildings are depicted in the imagery, which was captured from a nadir (bird's eye) view at the height of 5 to 30 meters distant from the ground [2]- [4].Images with a resolution of 6,000×4,000px were captured with a high-resolution camera (24Mpx) 5,6.The training set has 400 publicly available photographs, whereas the test set contains 200 private images.Objects in the image are listed as shown in Table 1 and Figure 1.
The lack of adequately annotated aerial images led to the creation of a new dataset as a first step.The Institute für Maschinelles Sehen and Darstellen, Graz, provided this new dataset, which consisted of randomly modified images.The algorithm that was utilized to obtain the data is described.1,000 images were selected randomly from the flight test database using this approach.A Microsoft Surface tablet, together with an Image Labeler from MATLAB toolbox were used to label each of these images.The labeling was done in three classes which are buildings, roads, and clouds.Figure 2 shows an example of one of the many labeled images.In ISSN: 2252-8938  Aerial image semantic segmentation based on 3D fits a small dataset... (Shouket Abdulrahman Ahmed)

2049
Figure 2(a), a settlement is depicted, which was selected as a test case to evaluate the developed method.
The settlement represents a specific area or location of interest where the method was applied.By examining Figure 2(a), we can observe the characteristics and features of the settlement, such as buildings, roads, and other structures.To provide a more comprehensive analysis, Figure 2(b) presents the processed street view derived from Figure 2(a).This processed image emphasizes the street-level perspective, highlighting the details and elements relevant to the assessment conducted using the developed method.By focusing on the street view, we can gain a clearer understanding of the specific attributes and conditions present within the settlement.

METHOD
Explaining research chronological, including research design, research procedure (in the form of algorithms, Pseudocode or other), how to test and data acquisition [2]- [4].The description of the course of research should be supported references, so the explanation can be accepted scientifically [5], [6].Figures 1  and 2 and Table 1 are presented center, as shown and cited in the manuscript [2], [7]- [12].The settlement curves produced at SG1 has been illustrated in Figure 2(a) and SG2 has been illustrated Figure 2(b).

Pre-processing
The existing computational limits restrict the integration of full-sized high-resolution imagery into a CNN; the training cannot be done efficiently due to computational issues (memory) [7]- [9].This necessitates preprocessing the images into smaller enhanced patches that will be used to train and test the models 10.The final prediction model was developed by mask, the final test images are resized back from the patches 11.Normalization of the spatial resolutions was also done for it to be used in the final integrated buildings dataset.Being that this is a popular semantic segmentation resolution, coupled with the availability of computing resources, the input image crops were chosen to have a resolution of 224×224 pixels.Image data generators were built in each step in the epoch for use with Keras, yielding batches of 32 randomly enhanced 224×224 crops from the high-quality images.In this project, the utilized random augmentations include horizontal flips, rotations, and vertical flips.The "mirror" fill mode was employed, which takes the vacant areas of the image generated by rotation and mirrors it until such areas are filled.Being that aerial imagery is not usually homogeneous to one orientation, the model can rely on these augmentations to learn from diverse perspectives.This also creates artificially additional training data that can be learned by the model for pattern generalization.

Data pre-processing
The images were subjected to data augmentation before the training step [12]- [14].The data augmentation was aimed at improving the dataset's size and performance, as well as the capacity to generalize

2051
patterns.The pixel values were additionally normalized to make them between 0 and 1, with the goal of obtaining the loss-optimized value using gradient descent with fewer epochs.Data normalization enhances learning and generally improves the convergence speed.Image flipping can be done either vertically or horizontally or both.However, not all frameworks support vertical flips [15], [16].A vertical flip is nothing but the rotation of an image by 180º, followed by its horizontal flipping.Horizontal flip is simply the horizontal reversing of all the columns and rows of an image pixel while the vertical flip is simply the vertical reversal of the entire columns and rows of an image pixel.

Cropping and image normalization
The select a piece of the original image at random.This part is then resized to the original image size.Random cropping is a frequent term for this technique.Image normalization is the important step for the training part because deep neural networks are all about writing a cost function and optimizing it so this will revolve around optimizer so to optimize the cost function in a smaller number of epochs its mandatory to do normalization or standardization but in case of images, normalization is the best approach [7]- [13] [14]- [19].

Gradient descent algorithm
This is a search strategy for finding the local minimum of a function that can be differentiated, such as a loss function.Gradient descent algorithm (GDA) is mostly used in machine learning (ML) to determine the parameters/weights of a function that optimally reduces a cost function, algorithm steps as shown in (1) [20].Repeat these steps until convergence is reached: − Calculate the parametric changes as a function of the learning rate based on the gradient.− Use the updated parameter value to recalculate the new gradient.− Check for the termination criterion, else revert to step one.The GDA is represented the θ parameter for optimization algorithm in the (1).
As a configurable parameter, the learning rate has a small positive value ranging from 0.0 to 1.0; it is used to train neuron networks (NNs) (note that logistic regression is a NN with just one neuron) [21], [22].Hence, the learning rate determines the rate of adaptation of the model to the situation.Some modifications in gradient descent algorithm and making workable for powerful deep neural networks [23], [24].There are a few drawbacks of GDA, especially the required number of computations for each iteration of the algorithm.For instance, assume a case of 20,000 data points with 20 features; here, the sum of squared residuals will contain the same number of terms as the data points (that is 20,000 terms in this case).Hence, the derivative of this function must be computed based on each of the features.This will require the following computation: 20,000×20=400,000 computations in each iteration.If we consider 1,000 iterations, we may be arriving at 400,000×1,000=400,000,000 computations to fully implement the GDA which is much an overhead; hence, GDA is not usually fast on huge datasets, leading to the development of stochastic gradient descent (SGD).The term "stochastic" here means "random" which comes to play when data points are being selected at each step for the calculation of the derivatives.The SGD picks one data point randomly from the entire dataset per iteration to minimize the computation requirements [25].
In stochastic gradient descent, we are feeding only one data point at each iteration, but it is a very slow process so to make it speed we have to create batches like in each batch consists of 16, 32, 48, 64 data samples and training on batches make the training process fast and reduce the computation as well [26]- [28].The Adam optimizer employs a hybrid of two gradient descent techniques: Momentum: This approach is used to speed up the GDA in consideration of the 'exponentially weighted average' of the gradients.The algorithm tends to converge rapidly towards the minima because it uses the averages.Root mean squared propagation (RMSprop) was proposed by Geoffrey Hinton as a gradient-based NN training approach.The gradients of complex functions, such as NNs tend to either vanish or explode as the data propagates through such function.RMS prop was created as a stochastic mini-batch learning algorithm that solves problems by normalizing the gradient with a moving average of squared gradients.This process of normalization equalizes the step size and either lower it for high gradients to prevent explosion or raise it for minor gradients to avoid vanishing.Simply expressed, RMSprop treats the learning rate as an adaptive parameter rather than a hyperparameter.This indicates that the rate of learning fluctuates with time.

Convolution layer
Convolutional layers are used to produce a larger receptive filter, which allows the model to recognize more input images features.Examining all of the pixels in an image is a simple way to process it; however, this can delay the training process dramatically.image pixels are sparse, there could be several zero pixels that have no important feature of interest.So, convolution is the application of filter to an image and its subsequent swiping over to narrow down the underlying image to more specific and distinct features that are more specific to the considered object [29].
A "convolutional layer is made up of a series of filters, each of which has its own set of parameters that must be learned.The filter is slid over the input's width and height, and the dot products between the input and filter are calculated at each spatial position before being fed into an activation function.The convolutional layer's output volume is done by layering the activation maps of all filters along with the depth" dimension [30].
Image compression in the substituting the feature map output with a summary statistic of nearby outputs, the pooling layer compresses an image.For instance, the maximum statistic is used in max-pooling to decide a sliding window's result, leading to a down-sampled feature map with fewer redundant pixels and faster training and lower memory use.In average pooling, the sliding window's result is computed using the average statistic; so, the averaging of the feature map is first done before it is transferred to the next layer.Therefore, the use of the pooling layer when building a deep neural network (DNN) can significantly reduce the amount of information conveyed by an image [31].
This function is used mostly in problems that involve multiclass classification because of its desirable performance in real number value compression into a value range of [32] and ensuring that the sum of the whole probabilities of the output equals 1.It uses Sigmoid as an activation function to compress all input values into a value range between [32].However, input values >0 produce a result >0.5 while those < zero produces results <0.5.Finally, any input with a 0 value produces a result equal to 0.5 [32].Most times the sigmoid function is commonly used as the final output layer in binary classification problems [33].
The dropout process is a simple one; The addition of a dropout procedure after a neuron network layer either destroys or removes a random number of neurons.The chance of deleting a neuron is determined by the predefined dropout rate, p; for instance, if p= 0.5, there is a 0.5 chance of losing every neuron before feeding to the next layer.The more the number of destroyed neurons, the greater the regularization impact.The downstream neurons' contribution to activation is temporally eliminated from the feed-forward path during the training phase, and their updated weight is not fed to the backward path neurons.Furthermore, the more the number of deleted neurons, the fewer the number of training samples for the subsequent layer, and this can cause the problem of underfitting.In practice, the hidden layer should have a typical dropout rate p in the range of 0.2 and 0.5, while that of the input layer is 0.2.
The training of DNNs is challenging for a variety of reasons, including the fact that the input from prior layers can change when weights are updated.Batch normalization is one of the network input standardization approaches that can be used for either prior layer activations or direct inputs.This process reduces generalization error by speeding up training and providing some regularization, in some cases halving epochs or more.

CONCLUSION
A model that fits well on the training data but performs poorly on unobserved data is said to be overfitted.For instance, a 3-D function is used to fit a small dataset given by a 1-D function.In deep learning (DL), a model with much capacities, such as one that learns more function mapping from input to output, can learn too well and overfit the training dataset; this happens mostly when the function makes a prediction that fits too close to the data point.For this research, the first processing material needed is the unmanned aerial vehicle (UAV) remote sensing image; however, the acquired data from the camera sensor comes with several disadvantages.Hence, the first thing is to take some transportation before the labeling and training phase.Furthermore, CNN was used to process and extract road information from the UAV images.Owing to the huge number of parameters in the CNN, the result cannot be assured if the model is applied to actual road detection these parameters have been randomly initialized.As a result, the network parameters must be trained before using CNN in the real world.The initial stage in evaluating the effect of network training is to train and test the samples manually.The application area of CNN is limited by the enormous number of samples it requires.In this paper, manual sample data labeling fell well short of the number of samples required for network training.Studies have previously proved the efficacy of data augmentation in the training of network models.As a result, samples for network training were added based on the real scene via data augmentation.

Figure 1 .Figure 2 .
Figure 1.List of objects in images

Table 1 .
List of objects in images