Arabic text diacritization using transformers: a comparative study
Abstract
The Arabic language presents challenges for natural language processing (NLP) tasks. One such challenge is diacritization, which involves adding diacritical marks to Arabic text to enhance readability and disambiguation. Diacritics play a crucial role in determining the correct pronunciation, meaning, and grammatical structure of words and sentences. However, Arabic texts are often written without diacritics, making NLP tasks more complex. This study investigates the efficacy of advanced machine learning models in automatic Arabic text diacritization, with a concentrated focus on the Arabic bidirectional encoder representations from transformers (AraBERT) and bidirectional long short-term memory (Bi-LSTM) models. AraBERT, a bidirectional encoder representation from transformers (BERT) derivative, leverages the transformer architecture to exploit contextual subtleties and discern linguistic patterns within a substantial corpus. Our comprehensive evaluation benchmarks the performance of these models, revealing that AraBERT significantly outperforms the Bi-LSTM with a diacritic error rate (DER) of only 0.81% and an accuracy rate of 98.15%, against the Bi-LSTM's DER of 1.02% and accuracy of 93.88%. The study also explores various optimization strategies to amplify model performance, setting a precedent for future research to enhance Arabic diacritization and contribute to the advancement of Arabic NLP.
Keywords
AraBERT; Arabic text diacritization; Bidirectional long short-term memory; Natural language processing; Transformers
Full Text:
PDFDOI: http://doi.org/10.11591/ijai.v14.i1.pp702-711
Refbacks
- There are currently no refbacks.
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
IAES International Journal of Artificial Intelligence (IJ-AI)
ISSN/e-ISSN 2089-4872/2252-8938
This journal is published by the Institute of Advanced Engineering and Science (IAES) in collaboration with Intelektual Pustaka Media Utama (IPMU).