Content-based image retrieval based on corel dataset using deep learning

ABSTRACT


INTRODUCTION
Large repositories for multimedia and image content have developed due to the growing usage of digital computers, storage technologies, and digital multimedia in recent years.This enormous volume of multimedia data is utilized in various industries, including digital forensics, electronic games, archaeology, video, satellite data and still image repositories, and medical treatment.This rapid growth has generated a continuous need for image retrieval systems that operate on a large scale [1].For large image databases, the traditional text-based image extraction approach seems ineffective.There are certain drawbacks to retrieving images based on text, such as the time-consuming task of adding labels to individual images in huge databases.That label text depends on language and is only appropriate for one language at a time.Another drawback is that multiple users can set different labels for the same image.When retrieving images from image content, these drawbacks can be avoided.This type of image retrieval is known as content-based image retrieval (CBIR) [2].
CBIR has been a popular technique of community multimedia research since the early 1990s [3].The main block diagram is shown in Figure 1.CBIR is the most crucial technology for image processing and computer vision.CBIR applications have been developed for various uses, including object recognition, geographic information systems, architectural design, remote sensing [4], surveillance systems, and medical image retrieval [5].CBIR is a well-defined image search and retrieval technique.It uses the visual content of images to find and retrieve images from huge data collections [6].CBIR uses the search method for low-level features such as texture, shape, and color [7].This set of low-level features generates a feature vector, which describes the content of each image in the image database.Subsequently, image retrieval is based on similarities in their contents.The similarity between the query and the feature vector dataset is used to sort the list of matching images [8].The features used by CBIR may be divided into two groups: global feature descriptors and local feature descriptors.Global features such as color [9], shape [10], and texture [11].Local features such as local binary pattern (LBP) [12], oriented fast and rotated binary robust independent elementary features BRIEF (ORB) [13], speeded-up robust feature (SURF) [14], scale-invariant feature transform (SIFT) [15], and histogram of oriented gradient (HOG) [16].
Local feature descriptors define each image patch, whereas global feature descriptors describe the whole image.The advantage of global feature descriptors is their quick computation, but the disadvantage is their poor precision.Global feature descriptors frequently fall short in their attempts to extract significant visual characteristics from an image.Local feature descriptors are more accurate than global feature descriptors because they use features calculated from the image's patch to represent the image.The disadvantage of local features is that they will result in large feature space for large image databases [17].The feature vectors and similarity measures have the biggest effects on the retrieval performance of the CBIR system.There is always a semantic gap between high-level human perception and the low-level image pixels that systems collect.Researchers decided to address this problem to enhance CBIR's performance in light of the recent success of deep learning techniques, particularly the performance of convolutional neural networks (CNN), in resolving the issue of computer vision applications [18].
Srivastava and Khare [19] presented a technique for CBIR that combines local and global features.Geometric moments extract global features, while SIFT descriptors extract local features.SIFT and moments are combined to find visually similar images.The Corel 1K has an average precision of 0.3981.Mehmood et al. [20] combine the SURF and HOG image features.After the final features were extracted using the bag-of-visual words (BoVW) model, they used Euclidean measurement to evaluate the similarity between the query image and the database images.The average precision is 0.8061 on the Corel 1K datasets.These methods have two main drawbacks: first, they require a lot of time, and second, finding the most silent places is not always easy.
Nazir et al. [21], suggest a new CBIR system that combines local and global features to handle lowlevel information.A color histogram (CH) is used to extract color information.Edge histogram descriptor and discrete wavelet transform (DWT) extract texture features (EDH).Based on the results of the experiments, the suggested method does better on the Corel 1K, with an average precision of 0.735.
Pardede et al. [22] suggested a CBIR method that employed deep CNN for feature extraction from fully connected FC1 and FC2 layers.The Feature Extractor utilizes the fully connected feature vectors FV.FC1 and FV.FC2 to extract image features from each image and compares the performance of deep CNN for CBIR tasks with three classifications: softmax, support vector machine (SVM), and extreme gradient boost (XGBoost).A deep CNN model was produced based on the suggested neural network structure.The results of the mathematical experiments suggest by utilizing the XGBoost classification, the extracted feature extractor from deep CNN can improve CBIR performance, and the best feature extractor is FV.FC2.The precision on the Wang dataset (Corel 1k) is 0.69.
Öztürk [23] proposes a useful CBIR framework.The dictionary learning method addresses the training issue with a small amount of labelled data.Dictionary learning (DL) cannot produce reliable features for the retrieval task, particularly when there is a complicated background.To address both the issue of identifying objects and dealing with complicated backgrounds, a DL technique utilizing CNN's (Resnet-50) feature representation capabilities is implemented in this system.When 10 images are retrieved, the mean average precision (mAP) for the modified Corel dataset is 0.855.Öztürk [24] provided a framework for contentbased medical image retrieval (CBMIR) based on high-level deep features.The insufficient number of photos is the main problem here.To address this issue, a class-driven retrieval strategy is suggested.Different hash code lengths are produced using feature reduction methods, and their performances are evaluated.Experiments with the National Electrical Manufacturers Association Magnetic Resonance Imaging (NEMA MRI) and the National Electrical Manufacturers Association Computed Tomography (NEMA CT) datasets show that the framework given is better than the existing methods in the literature.Desai et al. [25] suggested an effective deep learning architecture for fast image retrieval based on convolution neural networks CNN and SVM.SVM is used for classification to reduce the time required to retrieve the results.VGG16 is used to extract features.It has 12 convolutional layers, 4 fully connected layers, and, as the last layer, a SoftMax classifier.The average precision for retrieving 10 images from the Corel dataset is 0.8361.The VGG16 has 138 million parameters, which is a drawback since it causes an explosion in the gradient issue.This paper's main contributions may be summarized:  The paper has been arranged in the following manner: Section 2 presents the proposed methodology, and Section 3 displays the similarity measurement.In addition, Section 4 presents the experimental results and discussion.This paper is concluded in Section 5.

METHOD
In this paper, CNN is used for feature extraction since it is efficient at closing the semantic gap between high-level human perception and low-level machine features, as well as at finding the most relevant images and improving retrieval performance.The Corel 1K [26] database was utilized to validate the results.It is a 1,000-image database collection.These images are organized into 10 categories, each of which has 100 images.A block diagram of the proposed method is illustrated in Figure 2. CNN is a type of neural network model that uses several large network layers [27].CNN has gained popularity in various image processing applications, including object recognition [28]  1857 classification [29], and has produced promising results.CNN's are increasingly used in various imageprocessing tasks, such as object classification, face recognition, and gesture identification.According to earlier research, it is possible to input an image directly into a CNN network and use features for image classification [30].The basic CNN architecture is composed of convolutional layers, pooling layers, fully connected layers (FC), SoftMax layers, and non-linear activation functions like rectifier neural network (ReLU) [31], [32].The forward pass stage includes a convolution layer, where an activation map is produced as the result of computing the dot product of the filter's input volume and the filter's dot product.Next, use the ReLU function to decrease negative values and pooling to downsample the feature maps before activating the value.Numerous iterations of this phase are carried out, with no restrictions on how often it is repeated.The final step of the forward pass enters the fully connected layer, where the output is created in vector form.To determine whether the output belongs to the model class, the SoftMax values and error values can be computed for the values in the training dataset after getting the output from the fully connected layer.CNN works on image volumes.The input volume can therefore be thought of as the input image.Width, height, and depth are the three dimensions that make up the volume.CNN's initial values for the input volume are W, H, and D [33].
The filter shifts from the top to the bottom of the input volume, beginning at the top left and moving to the top right.Every motion from left to right is performed as thoroughly as a stride.The number of steps convolutes its stride.Since ReLU transforms the negative pixel value to 0, it is a quick activation function.When the value is 0, the result is 0 [34], as seen in ( 1).The hidden layer's size is huge after the convolution procedure.It is typical to utilize a pooling or sub-sampling layer right after a convolutional layer to decrease computational complexity.Max and average pooling are two types of pooling that are commonly utilized [35].Let y=yij represent the matrix in a pool.
Using the maximum element in y as the output is known as max pooling, as seen in (2).
Taking the average of all the element values (yij) is known as average pooling [18], as seen in (3).
(3) M and N represent the elements in the pooled matrix.This paper employed two experiments, the first with max pooling and the second with average pooling.The proposed CNN architecture model comprises 19 layers, as shown in Figure 3.A proposed CNN model is used to extract dataset feature vectors.The model contains six convolutional layers and six batch normalization layers to normalize the data.Each twoconvolution layer is followed by a pooling layer, two dropout layers for regularities, and two fully connected layers.The original images, which are 384×256 or 256×384 pixels in size, will be resized to 64 x 64 pixels before they are fed into the CNN model.In the layers of the CNN model, the filter is scaled up and down so that features can be found.The internal architecture of the CNN model used to train the CBIR is shown in Table 1.The hyperparameters used in this paper are illustrated in Table 2.For the optimizer, Adam is used.

Image retrieval
Determine the difference in similarity between the feature vector obtained from the query and the feature vector from the training dataset by using two similarity measurements: Euclidean and Manhattan.As indicated in Algorithm 1, the images with the shortest distance are returned.Evaluation matrices are used to assess the system's performance.

SIMILARITY MEASUERMENT
The Euclidean and Manhattan distances measure the relationship between the feature vectors of the query images and the feature vector of the training dataset images.If the distance between the two vectors is the smallest with reference to the other distances, the generated image and the query are similar, as seen in ( 4) and ( 5), respectively.Where m represents the size of the feature vector; Fvdb and Fvq are the feature vectors for the dataset and query images, respectively.

Dataset
The Corel 1K dataset [26] has 1,000 JPEG images with a 256 by 384 or 384 by 256-pixel size.100 images in each of the 10 categories make up this collection.The categories include Africa, horses, flowers, beaches, buses, buildings, mountains, dinosaurs, elephants, flowers, and food.Figure 4 displays an example of each category.Retrieval was made more challenging and robust by keeping all the training images in one folder and all the testing images in another.This means a test folder with 100 images, each of which is a query image, and a training folder with 900 images.

Evaluation measurement
Performance in the CBIR is assessed using precision and mean average precision (mAP).Divide the total number of images retrieved by the total number of relevant images to determine precision.It shows a system's ability to only return relevant images, as seen in ( 6).The mAP is the mean of the average precision for all classes.For average precision (AP), see (7), and mAP, see (8).Where P represents the precision, and n is the number of images, APk denotes the AP of class K, and m denotes the number of classes.

Experiment results
Retrieval performance has been assessed using precision.Higher precision means that it returns more relevant images than irrelevant ones.The precision of each query across all categories was calculated along with the average determined precision.The Corel 1K image dataset is utilized.The Corel 1K dataset contains diverse images ranging from natural scenes to outdoor activities to various animals, making it suitable for testing image retrieval systems.Two retrieval methods are used.The first method is based on CNN with max pooling, while the second is based on CNN with average pooling and various feature sizes.After several attempts to find the right hyperparameter value, shown in Tables 3 and 4, where the learning rate is 0.01, and the number of epochs is 100 and 500, respectively, the hyperparameter in Table 2 fits the architecture made in this paper.The results depend on the nature of the images.Some classes have simple and distinct colors that make distinguishing objects from the background easier.Other classes have similar colors, so it is difficult to distinguish.Also, max pooling selects the strong pixels and almost neglects the weak ones.It works as an edge  ISSN: 2252-8938 Int J Artif Intell, Vol. 12, No. 4, December 2023: 1854-1863 1860 detector.As for average pooling, it takes a group of pixels, combines them, and divides them by a number according to the length of the matrix, working as an image smoother.Bus, Buildings, dinosaurs, flowers, horses, and mountains have the highest average precision of any other class in the average pooling.For max pooling, the classes bus, dinosaur, flower, horse, and mountain have the highest average precision.Table 5 shows the average precision for Euclidean distance.A feature size of 128 based on CNN with average pooling achieves higher average precision than max pooling when using Manhattan distance, as seen in Table 6.The best result, as shown in Tables 5 and 6, was achieved at Euclidean distance with feature sizes of 256 and 10 retrieve images.Figure 5 compares the average precision measured on the Corel 1K dataset to state-of-the-art traditional methods.The retrieved images, according to a query image using CNN, are illustrated in Figure 6.   2. As a result, compared to methods in related work for the same dataset, this paper's structure and parameters produced better results.Table 7.Comparison with a state-of-the-art method Authors in [22] 2019 Authors in [25]

CONCLUSION
This paper proposed the CBIR technique using CNN for feature extraction.Two different pooling layers are used to extract features: max and average.The performance of the proposed method was assessed using precision and mAP.Euclidean and Manhattan similarity measurements are used to compute the distance between the query and database image features.The experiment results on the Corel 1K dataset with Euclidean showed a significant improvement in average precision of 0.88 when using average pooling with a feature size of 256 for retrieving the first 10 images when compared to other methods that had previously been proposed, such as CNN+SVM, SIFT, local and global CH for a color feature, DWT+EDH for a texture feature, and BoVW that used two feature extractions like HOG and SURF.The proposed method is more accurate than the existing state-of-the-art approaches, which are good and promising.

Int
image retrieval based on corel dataset using deep learning (Rasha Qassim Hassan) 1855

−
ISSN: 2252-8938 Int J Artif Intell, Vol. 12, No. 4, December 2023: 1854-1863 1856 This paper employed CNN to extract deep features from photos to bridge the semantic gap between highlevel human perception and the low-level picture pixels that computers gather to produce the most relevant images.− The goal is to determine the appropriate CNN architecture while focusing on selecting the best hyperparameter values.− Two experiments are used: the CNN model with max pooling and the CNN model with average pooling.− In similarity measurement phase, two distance measurements were implemented: Euclidean and City Block (Manhattan) to find the best in terms of mAP.

Figure 1 .
Figure 1.Block diagram of content-based image retrieval

Figure 2 .
Figure 2. Block diagram of the proposed method and picture Int J Artif Intell ISSN: 2252-8938  Content-based image retrieval based on corel dataset using deep learning (Rasha Qassim Hassan)

Algorithm 1 :
image retrieval Input: Feature vector of 128 X 1 X 1, Output: Similar image and average precision − Label the feature vector of images for all classes.− Convert nominal classes to numeric values.Ex. the bus is 3, the flower is 6 − Dataset is divided into 900 training and 100 testing − Compute the distance between feature vectors of all testing and training and retrieve the smallest distance.− Calculate the mean average precision for all testing (Queries) − Display the most similar images for the query image.

Figure 4 .
Figure 4. Examples of each category in the Corel 1K dataset

Table 1 .
The internal architecture of the CNN model

Table 3 .
Average precision results when the learning rate is 0.01 and epochs are 100

Table 4 .
Average precision results when the learning rate is 0.01 and epochs are 500

Table 5 .
The average precision for each class with different feature vector sizes for average pooling and max pooling based on Euclidean distance

Table 6 .
The average precision for each class with different feature vector sizes for average pooling and max pooling based on Manhattan distance ISSN: 2252-8938  Content-based image retrieval based on corel dataset using deep learning (Rasha Qassim Hassan)