TunDC: a public benchmark dataset for sentiment analysis and language modeling in the Tunisian dialect
Abstract
The development of natural language processing (NLP) applications has increasingly focused on dialectal variations of languages. The Tunisian dialect (TD), a widely spoken variant of Arabic, poses unique linguistic challenges due to its lack of standardized writing conventions and influences from multiple languages, including French, Italian, Turkish, and Berber. In this work, we introduce TunDC, a dataset of 20,044 labeled comments designed to advance NLP research on the TD. The dataset covers diverse linguistic forms (Arabic, Latin, and mixed scripts), and each comment was manually annotated for positive or negative sentiment by native speakers, achieving high inter-annotator agreement. To evaluate its effectiveness, we fine-tuned various models on TunDC. The bert-base-arabic-TunDC-mixed model achieved an accuracy of 0.84 and a macro-averaged F1-score of 0.83, demonstrating strong generalization across sentiment categories and writing systems. A stratified data-splitting strategy considering both sentiment and script type further improved accuracy by approximately 8% compared to standard splits. As a publicly available resource, TunDC contributes to the computational linguistics community, fostering advancements in language modeling and applications tailored to the TD.
Keywords
Arabic dataset; Artificial intelligence; Fine-tuning; Large language model; Low-resource language; Sentiment analysis; Tunisian dialect
Full Text:
PDFDOI: http://doi.org/10.11591/ijai.v15.i2.pp1891-1908
Refbacks
- There are currently no refbacks.
Copyright (c) 2026 Ahmed Khalil Boulahia, Mourad Mars

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
IAES International Journal of Artificial Intelligence (IJ-AI)
ISSN/e-ISSN 2089-4872/2252-8938
This journal is published by the Institute of Advanced Engineering and Science (IAES).