Transformer+transformer architecture for image captioning in Indonesian language

Bryan Christofer Wijaya; Hendrik Santoso Sugiarto

doi:10.11591/ijai.v14.i3.pp2338-2346

Transformer+transformer architecture for image captioning in Indonesian language

Bryan Christofer Wijaya, Hendrik Santoso Sugiarto

Abstract

Image captioning in Indonesian language poses a significant challenge due to the complex interplay between visual and linguistic comprehension, as well as the scarcity of publicly available datasets. Despite considerable advancements in this field, research specifically targeting the Indonesian language remains scarce. In this paper, we propose a novel image captioning model employing a transformer-based architecture for both the encoder and decoder components. Our model is trained and evaluated on the pre-translated Flickr30k dataset in the Indonesian language. We conduct a comparative analysis of various transformertransformer configurations and convolutional neural network (CNN)-recurrent neural network (RNN) architectures. Our findings highlight the superior performance of a vision transformer (ViT) as the visual encoder, combined with IndoBERT as the textual decoder. This architecture achieved a BLEU-4 score of 0.223 and a ROUGE-L score of 0.472.

Keywords

Image Captioning, Deep Learning, Computer Vision, Natural Language Processing, Transformer Architecture

Full Text:

PDF

DOI: http://doi.org/10.11591/ijai.v14.i3.pp2338-2346

Refbacks

There are currently no refbacks.

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

IAES International Journal of Artificial Intelligence (IJ-AI)
ISSN/e-ISSN 2089-4872/2252-8938
This journal is published by the Institute of Advanced Engineering and Science (IAES).

View IJAI Stats

Username
Password
Remember me