Integrating IndoBERT and balanced iterative reducing and clustering using hierarchies of BERTopic in Indonesian short text
Abstract
Short text topic modeling remains challenging due to data sparsity, limited word co-occurrences, and unstable clustering results, particularly for Indonesian texts. This study proposes an improved BERTopic framework that integrates IndoBERT embeddings, best match 25 (BM25)-based topic representation, and balanced iterative reducing and clustering using hierarchies (BIRCH) clustering to address these issues. IndoBERT generates contextual embeddings adapted to Indonesian linguistic features, and BM25 weighting improves keyword relevance by considering document length and term saturation. BIRCH clustering minimizes outliers by assigning most documents to valid clusters, which enhances data utilization and topic stability. Experiments on Indonesian datasets from X (formerly Twitter), Google Reviews, and YouTube demonstrate that the proposed approach consistently achieves higher topic coherence. The proposed method yields stable topic diversity values between 0.91 and 0.94, maintains embedding density from 0.60 to 0.66, and achieves intra-topic similarity between 0.39 and 0.41 across increasing dataset sizes. The proposed framework successfully reduces outlier proportions to 1-5%, which significantly outperforms standard BERTopic and K-Means. Furthermore, the model maintains stable topic counts as the data volume grows, confirming robustness and scalability for sparse short text modeling. Overall, integrating IndoBERT, BM25, and BIRCH provides a more coherent, stable, and effective solution for Indonesian short text topic modeling.
Keywords
BERTopic; BIRCH; BM25; IndoBERT; Indonesian short text
Full Text:
PDFDOI: http://doi.org/10.11591/ijai.v14.i5.pp4192-4201
Refbacks
- There are currently no refbacks.
Copyright (c) 2025 Muhammad Muhajir, Gunardi, Danardono, Dedi Rosadi
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
IAES International Journal of Artificial Intelligence (IJ-AI)
ISSN/e-ISSN 2089-4872/2252-8938
This journal is published by the Institute of Advanced Engineering and Science (IAES).