A study on attention-based deep learning architecture model for image captioning

ABSTRACT


INTRODUCTION
Image captioning is the ability to describe the contents of an image in the form of sentences [1].This ability requires methods from two fields of artificial intelligence, namely computer vision to understand the content of a given image and natural language processing (NLP) to convert the content in the image into sentence form.Due to the advancement of deep learning models in these two fields, image captioning has received a lot of attention in recent years.On the computer vision side, the improvement of the convolution neural network (CNN) architecture and object detection contributed to the improvement of the image captioning system.On the NLP side, more advanced sequential models, such as attention-based recurrent networks, result in more accurate text generation.
Most successful image captioning uses an encoder-decoder approach inspired by the sequence-tosequence model for machine translation.This framework uses CNN + recurrent neural network (RNN), where CNN is used as an image encoder that extracts region-based visual features from the input image and RNN is used as a caption decoder to make sentences [2].With the development of machine translation, a new architecture emerged, namely attention, which has become a state-of-the-art method in the NLP field and makes image captioning models.Transformer [3] is an attention-based architecture that largely being adopted in the field.
This study aims to discover state-of-the-art method in image captioning, especially attention-based architecture, including transformer.We also take Indonesian as one of the low resource languages as a case study to give a picture of how mature the development of image captioning model in such language.An analysis of the model and architecture used, as well as the results of the study was carried out.Additionally, we develop a general concept that indicate what needs to be done, what techniques and data are typically used in image captioning research.The results of this study are expected to be a reference for the image captioning method in future research and provide recommendations for the best method for doing image captioning.

METHOD 2.1. Research flow
The systematic literature review presented in this paper has two main steps in its methodology.The first step is to formulate and conduct a search for related literatures.The second step is to formulate, conduct, and discuss the literature analysis from several points of view.Details of the methodological steps carried out will be explained in the next section.

Searching method
The searching process for related literatures was carried out in three parts.The process uses the help of the Google Scholar search engine.The searching process will select literatures that meets the following criterias: i) the scientific publication literature uses the attention mechanism model as the basis for doing image captioning; and ii) the literature provides details on the implementation of the method in image captioning task.
The first part of the searching process uses three main keywords: image captioning, transformer models, and attention mechanism.The scientific publication taken in this part is publications within the time frame between 2017 and 2020.This time frame is taken to depicts the growth and the development of the attention-based mechanism in image captioning task.The second part of the searching process is carried out with the same keywords as the first part, but the time frame is between 2021 and 2022.In this part, only top four cited open access publication is taken.The aim of the second part is to get some knowledge how attentionbased model in image captioning still can be improved.The last part is slightly different from the previous parts.The third part focuses on the study of image captioning in Indonesian.Indonesian image captioning is used as the keyword.This part intends to search studies using Indonesian language datasets from 2017 to 2022 that use attention-based model including transformer.

Analytical method
Selected literatures are analyzed from several points of view.There are five analyzes to be carried out, namely architectural analysis, method analysis, dataset analysis, metric evaluation analysis, and general modeling model analysis of image captioning.The analysis was carried out with the aim of obtaining information that could support the growth of research in this study.The results of the analysis are presented in tabular form with a detailed explanation of the analysis.

RESULTS AND DISCUSSION
Using the keywords given in the literature search, 27 literatures were obtained in the first step.Of the 27 literatures, 25 literatures met the search criteria, and two literatures did not meet the search criteria.One literature does not meet the specified time range between 2017 to 2020 and one last literature does not use the attention mechanism model.In the second search process, we selected 4 literatures from 2021, and 4 others from 2022 that are on the first five pages of search results and can be accessed.While our third search resulted in 4 literatures that match our criteria.
Table 1 (see in Appendix) shows the comparison between the selected literatures according to the search criteria.The table provides a comparison between the architecture, the methods, and the evaluation metrics results.The evaluation metrics presented are bilingual evaluation understudy (BLEU), metric for evaluation for translation with explicit ordering (METEOR), recall-oriented understudy for gisting evaluationlongest common subsequence (ROUGE-L), consensus-based image description evaluation (CIDEr), and semantic propositional image caption evaluation (SPICE).The higher the score, the more accurate the predicted caption based on the referenced caption.The elaboration of the architecture, the dataset, and the evaluation metrics analysis are presented in the following section.The literature comparison for papers that used Indonesian dataset is presented in

Architecture analysis
In this section, the elaboration of different approaches in developing image captioning model are presented.The analysis of the elaboration is grouped into two.The first one is the vanilla attention mechanism, and the second one is the transformer which uses a more specific attention mechanism called self-attention.

Attention mechanism
Of the image captioning methods that use the attention mechanism, many use a CNN encoder and an RNN or long short-term memory (LSTM) decoder.Among them are the hierarchical attention network (HAN) [7] which pays attention to semantic features of various levels that make it easy to predict different words based on different features, while the multivariate residual module (MRM) makes it easy to extract relevant relations from various features.There are also other methods that also use the same encoder and decoder, namely, attention on attention (AoA) [8], scene graph auto-encoder (SGAE) [12], Adaptive attention through visual sentinel [20], policy optimization gradient SPIDEr [22], and recurrent fusion network (RFNet) [15].In addition, research that utilizes spatial and semantic information using attention mechanisms, namely the graph convolutional network-long short-term memory (GCN-LSTM) [17], and spatial and channel wise attentions in a CNN (SCA-CNN) [21] which can score high when compared to state-of-the-art models.LSTM-P [5] uses an RNN-based language model that presents a novelty, namely the exploitation of the pointer mechanism to accommodate dynamic word generation through an RNN-based language model and word copying from the object being studied.hierarchy parsing (HIP) [6] which integrates a hierarchical structure into an image encoder.HIP functions as a feature refiner producing a rich and multi-level representation of the image.Unsupervised image captioning [16] uses an encoder, generator, and descriptor.In this method, CNN will encode the input image into a feature representation, then the LSTM as a generator will decoding the image representation into a sentence that describes the image content, while the LSTM descriminator is tasked with distinguishing the original sentence from the sentence generated by the model.The combination of top-down and bottom-up attention mechanisms [19] where bottom-up based on Faster R-CNN processes the image area and converts it into feature vectors, while top-down determines feature weighting.Research [23] overcomes the variation and ambiguity of image descriptions with the convolution technique.Research on updating the long short-term memory (LSTM-A) architecture [25] has succeeded in integrating attributes in CNNs + RNN framework image captioning.Having a proven performance in generating meaningful sentences and being very successful in advancing state-of-the-art, the attention mechanism is still being used in recent research [33].Prophet attention [33] was introduced for calculating the ideal attention weights towards image region by using the future information.
According to our search, attention is rarely used in Indonesian image captioning from 2017 as we only found two papers that matched our criteria.To produce the next word, adaptive attention [34] was used to determine when and at which part of the image should be focused on by using translated Microsoft Common Objects in Context (MS COCO) and Flickr30k datasets.Research [35] applied visual attention mechanism to their model to produce a caption for the image that makes greater sense.As the result, their model was able to give a sensible and detailed caption in the local tourism domain.

Transformer
Transformer works as an encoder-decoder architecture that uses an attention mechanism.The captioning transformer (CT) study [18] was developed to overcome the problem of image captioning which is often developed using an LSTM decoder.Although good at remembering sequentially, LSTM has a complicated sequential problem in terms of timing.CT only has an attention module without a time dependency, so this model not only remembers sequence dependencies, but can also be trained in parallel.
Research [1] uses a spatial graph encoding transformer layer with a modified encoding transformer arrangement and an implicit decoding transformer layer which has a decoder layer and an LSTM layer in it to overcome the structure of the semantic unit of the image and each word in a different sentence.Research [4] improves image encoding and text prediction by using meshed transformer with memory to get low-and highlevel features so that it can predict images that are not even in the training data.Multimodal transformer [2] is composed of an image encoder and a text decoder simultaneously capturing intra and inter-modal interactions  ISSN: 2252-8938 such as the relationship of words, objects, words in attention blocks that can produce captions accurately.Boosted transformer [9] utilizes semantic concepts (CGA) and visual features (VGA) to enhance the description of the resulting image.Personality-captions [13] uses TransResNet and a dataset that supports personality differentiation to produce image descriptions that are closer to humans.Conceptual dataset [14] was also developed using Inception-ResNetv2 as a feature extractor and transformer to perform image captioning.Another study with encoder and decoder transformer using object spatial relationship model [10] was built by explicitly including spatial relationship information between input objects detected from attention geometric.EnTangled transformer [11] was developed to exploit all semantic and spatial information from images.
Since various transformer-based models have achieved promising success on the image captioning task [31], recent research has still widely used it.dual-level collaborative transformer [26] was proposed to complement region and grid features for image captioning by applying intra-level fusion via comprehensive relation attention (CRA) and dual-way self attention (DWSA).Global enhanced transformer (GET) [27] makes it possible to obtain a more comprehensive global representation, which guide the decoder in creating a highquality caption.Caption transformer (CPTR) [28], as a full transformer model, is capable of modeling global context information throughout encoder at every layer.Transformer-based semi-autoregressive model for image captioning, which keeps the autoregressive property in global and non-autoregressive property in local, tackles the heavy latency during inference issue that is caused by adopting autoregressive decoders [29].Spatial and scale-aware transformer (S 2 transformer) [30] explores both low-level and high-level encoded features simultaneously in a scale-wise reinforcement module and learns pseudo regions by learning clusters in a Spatial-aware Pseudo-supervised module.Relational transformer (ReFormer) [31] was proposed to improve the quality of image captions by generating features that have relation information embedded, as well as explicitly expressing pair-wise relationships between images and their objects.While research [32] used a transformer-based architecture called attention-reinforced transformer to overcome the problem of cross entropy limiting diversity in image captioning.
For research with Indonesian dataset, we only found two paper that use transformer in their study, [36] and [37].The result of research [37] showed that the implementation of the transformer architecture significantly exceeded the results of existing Indonesian image captioning research.In addition, the use of EfficientNet model obtains better results than InceptionV3.Research [36] has different approach, which use ResNet family as the base of visual feature extraction.

Dataset analysis
From the analysis conducted on existing studies, five main datasets were generaly used for image captioning, namely conceptual captions, MS COCO, Flickr8K, Flickr30K, and a specially made dataset for local tourism domain.The composition of the four datasets and studies that utilize these datasets can be seen in Table 3.On Table 3, we can see the detail of the number of images in the training, the validation, and the testing dataset.Furthermore, we can see the number of annotations for each image in different dataset.

Conceptual captions
Conceptual captions [14] were created using the Flume pipeline.This pipeline processes billions of internet pages in parallel.From these web pages, extraction, filtering, and pairing processes (images, descriptions) were carried out.The filtering and processing steps are divided into four, namely image-base filtering, text-based filtering, image & text-based filtering, and text transformation with hypernymization.

MS COCO
MS COCO [38] is a dataset from Microsoft COCO which is a large dataset containing object detection, segmentation, and captioning.This dataset has 328,000 images with a total of 2,500,000 labels, 80  3, using Karpathy's splits [39].
The image captions dataset in MS COCO consists of two dataset collections.The first dataset, MS COCO c5, has five text references for each image in the MS COCO dataset training, validation, and testing.The second dataset, MS COCO c40, has 40 text references and randomly selects 5,000 images from the MS COCO testing dataset.MS COCO c40 builds on the many automated evaluation metrics that give results that achieve higher correlations than human judgments when given more references [40].

Flickr30K and flickr8K
Flickr30K [41] is a popular dataset used as a benchmark for text generation and retrieval.Flickr30k has 31,783 images focused on humans and animals, as well as a total of 158,915 English subtitles for these images to reference.While Flickr8K [42] has a total of 8000 images collected from Flickr.Each image in the dataset has five human-annotated captions.

Indonesian datasets
To translate the English MS COCO dataset into Indonesian, two methods were used: Google Translate [36], [37] and manual translation [34].However, the results of the google translate translation are not very good, so research [34] used both of those two methods in their study to get a good Indonesian dataset.Study with a specific domain, requires a specially made dataset because it has not been available before.Research [35] collected a total of 1,696 local tourism-related images from Google search engines.

Evaluation metrics analysis
From the analyzed literature, five different evaluation metrics are commonly used to evaluate image captioning: BLEU, METEOR, ROUGE, CIDEr and SPICE.These metrics mainly measure the similarity between generated and reference captions through word overlap.Especially in SPICE, it uses "scene graph" to measure the similarity [43].BLEU, METEOR, and ROUGE-L are evaluation models originally developed to assess the performance of Machine Translation.In recent years CIDEr and SPICE were developed specifically to evaluate image captions and showed more success than the previous ones [44].
The evaluation is done to measure the quality of a candidate caption ci given a set of reference captions   = { 1 ,  2 , … ,   } ∈ S. When the sentences are represented using sets of n-grams, ωk ∈ Ω is a set of one or more ordered words.For the candidate sentence ci ∈ C, hk(sij) or hk(ci) denotes the number of times an n-gram ωk occurs in a sentence sij.
BLEU calculates the n-gram overlap between candidate and reference texts to evaluates candidate texts.The BLEU score is calculated by the geometric mean of the modified n-gram score accuracy.The score is multiplied by a short penalty factor to give a "punishment" to short sentences so that the evaluation results are more representative [44].The clipped n-gram precision between sentences is computed at the corpus level as shown in (1) [40].In this case, k represents the set of possible n-grams of length n.While   favors short sentences as it's a precision score.It is also used a brevity penalty as in (2) to favor a short sentence.Here,   represents the total length of candidate sentences   's and   represents the corpus-level effective reference length.For the brevity penalty, we use the closest reference length whenever multiple references exist for a candidate sentence.To calculate the overall BLEU score, a weighted geometric mean of the individual n-gram precision is applied as in (3) where the values of  are 1, 2, 3, 4 and   is usually constant for all .
Metric for Evaluation for Translation with Explicit Ordering (METEOR) evaluates the candidate text based on overlapping unigrams between candidate and reference texts.This corresponds to a unigram based on meanings, exact and stemmed forms [44].During calculating the alignment between the words in the candidate and reference sentences, the number of contiguous and identically ordered chunks of tokens in the sentence pair (ℎ) is minimized.This evaluation is conducted using the default parameters ,  and .So, based on a set of alignments (), the METEOR score is derived from the harmonic mean of precision (  ) and recall (  ) between the candidate and reference with the best score [40], see ( 4)- (8).
ROUGE-L gives an evaluation by automatically comparing generated summaries with human reference summaries based on longest common subsequences (LCS).The LCS is between the generated summaries and the human result summaries.If high similarity is shown then the summary system quality is considered good [45].Considering (  ,   ) is the length of the LCS between two sentences,   is obtained by calculating the F-measure [40].In ( 9)-( 11) are used to calculate the metrics.  is recall and   is precision of LCS.While  is typically set to favor recall ( = 1.2).(11) Consensus-based image description evaluation (CIDEr) gives measurements to the consensus between candidate and reference texts using n-gram matching.Term frequency inverse document frequency (TF-IDF) weighting is calculated for n-grams that are common in all texts [38].The frequency of occurrence of n-grams   in the reference sentence for the candidate sentence   is denoted by ℎ  (  ) or ℎ  (  ). calculates the TF-IDF weighting   (  ) for each n-gram   by using [40] (12). represents the set of all images in the dataset and  represents the vocabulary of all n-grams.The first term calculates the TF of each n-gram   , while the second term calculates the rarity of   by using its IDF.To calculate the   score for n-grams of length , we use the mean cosine similarity between candidate sentences and reference sentences.considering precision and recall the calculation is as in (13).The vector   (  ) represents all ngrams of length n and is formed by   (  ), while ‖  (  )‖ represents their magnitude.Likewise, for  (  ).Grammatical properties and richer semantics were captured by longer n-grams.The scores from n-grams of various lengths were combined as (14).  = 1/ is used for the uniform weights, with 4 as the value of .
Semantic propositional image caption evaluation (SPICE) estimates text quality by converting candidate and reference texts into semantic representations called "scene graphs" that encode objects, attributes, and relationships found in the text [44].In this evaluation, we first define the parsing captions' subtask to scene graphs.We parse a caption  into a scene graph, given a set of attribute types A, a set of object classes C, and a set of relation types R, as (15) [46].() ⊆  is the set of objects named in , () ⊆ () ×  × () is a hyper-edge set that represents the relationship between objects, and () ⊆ () ×  is the set of attributes related to the object.For the second step, we calculate the F-score.We view the scene graph semantic relationships as a conjunction of logical propositions or tuples, to compare how closely two scene graphs, resemble one another.So, we have a function T that reads the scene graph and returns a logical 29 tuple as (16).Each tuple consists of one, two, or three components that, accordingly, represent the objects, attributes, and relations.We define the binary matching operator ⊗ as the function that returns matching tuples in two scene graphs by looking at the semantic propositions in the scene graph as a set of tuples.Next, we define , recall , and precision  as ( 17)- (19).
(() ≜ () ∪ () ∪ () Of all the metrics used, namely BLEU-n, METEOR, ROGUE-L, CIDEr and SPICE, we selected four evaluation metrics that are used in almost all literature: BLEU-4, ROUGE-L, CIDEr, and METEOR.Table 4 shows the average value obtained in the evaluation metric of the transformer and attention architecture used in the literature discussed.However, calculations in Table 4 only include literature that contains c5 and c40 scores.It can be seen from the table that the transformer model gets a higher score for each evaluation metric than the vanilla attention mechanism model.

General concept
At this stage of analysis, the steps for modeling the image captioning model that are commonly found are formulated.The general model formulation factors are taken from the dataset, preprocessing, architecture, and methods side, as well as the evaluation used.Figure 1 shows a generic flow chart based on the general modeling of the literature review.
Many of the datasets used are open-source.The dataset is available along with the distribution of the dataset for image captioning, such as Flickr30k with data totaling 30 thousand images, Flickr8k with data totaling 8 thousand images, MS COCO 180 thousand images, and Conceptual 3.3 million images.To see the distribution of the dataset, the number of images and the literature can be seen in Table 3.
Preprocessing is mostly done by following the steps of Karpathy [39].To perform the preprocessing stage of the MS COCO dataset, you can follow the available source code [47].Karpathy's preprocessing stage includes tokenizing, lowercase the text, then changing the word to "UNK" (unknown) or deleting words whose frequency is less than 5.In addition, some literatures also set varying caption lengths for MS COCO and Flickr30k [12], [20].These words are then represented using GloVe or one-hot-encoding [2], [25].
The architecture and methods studied show various transformer and attention models.Many models rely on encoder and decoder frameworks because they are considered flexible [48].The flexibility is not only in the designing the model architecture, but also the flexibility in implementing such architecture to different domain, for instance, molecular image captioning [49].The role of the encoder is to extract features from the input image.While the decoder is useful for generating grammatically appropriate words.In this study, the most widely used encoder is the CNN variation model, while the most widely used decoders are LSTM, CNN, and RNN [2], [5], [7], [8], [42].The method used is a modification of transformers and attention.The discussion of the method can be seen in section 3.1.The methods studied were obtained from 36 literatures.These methods are evaluated using the original dataset or from Karpathy [8].
The evaluations that are widely used are the BLEU, METEOR, ROUGE-L, CIDEr, and SPICE evaluation metrics.The evaluation score is widely used to evaluate the results of image captioning.To evaluate much of the literature we reviewed, we used the source code available from MS COCO [50] to calculate the evaluation metric.From all the literature that we reviewed, the evaluation metrics that are often used are BLEU-4, Rouge-L, CIDEr, and METEOR which are popular and known to have a strong correlation with human judgment [51], [52].

CONCLUSION
From this study on attention-based deep learning architecture model for image captioning, 36 literatures were found.With all models are attention-based architecture, half of the works listed in this study uses transformer.Five types of evaluation metrics were used across the works: BLEU, CIDEr, METEOR, ROUGE-L, and SPICE.BLEU still become the most used evaluation metrics for image captioning.From the analysis, it is known that on average, the transformer model obtains higher evaluation metric score at BLEU-4, CIDEr, ROUGE-L, and METEOR than the works with vanilla attention mechanism model.In the Indonesian language domain, as one example of a low resource language, only few works are found and most of them still rely on the common MSCOCO dataset as the base.However, the is an effort to create a novel dataset which is great to capture local culture in the caption.Finally, this study provides a foundation for the future development of the image captioning model and presents a general understanding of the process of developing image captions, including a representation of the process in low resource languages.

ACKNOWLEDGEMENT
The

Int
attention-based deep learning architecture model for image captioning (Dhomas Hatta Fudholi) 27 object categories, and 91 object categories.This dataset is divided into training, validation, and testing data, as in Table

 30 Figure 1 .
Figure 1.General flow diagram of image captioning model and concept

Table 2 .
Literature review result for Indonesian datasets

Table 4 .
Average score for each evaluation metric

Table 1 .
study is sponsored by SAME (Scheme for Academic Mobility and Exchange) 2022 program (Decree No. 3253/E4/DT.04.03/2022) from Directorate of Resources, Directorate General of Higher Education, Research and Technology, Ministry of Education, Culture, Research and Technology, Republic of Indonesia.Literature review result