Evaluating document chunking approaches for retrieval augmented generation in editorial content
Abstract
Retrieval-augmented generation (RAG) systems promise grounded answers from large language models (LLMs), yet performance depends critically on how source documents are segmented before indexing. This study investigates how pre-index chunking strategies affect both retrieval accuracy and answer quality in domain-specific scenarios. We curated a corpus on software-as-a-service (SaaS) editorial content and constructed a high-quality evaluation dataset containing 2,419 question-answer (QA) pairs generated through automated prompting and quality control. We compared four chunking approaches, including fixed-size, structure-aware recursive, semantic, and LLM-based methods. Our evaluation protocol assessed retrieval through document localization, semantic similarity, and context relevance, while generation quality was evaluated using chain-of-thought (CoT) criteria driven by judgments from LLMs. Results demonstrate that recursive chunking consistently outperforms other approaches across all metrics. Smaller chunks improve document localization, while moderately larger chunks enhance semantic alignment and generation scores. LLM based chunking variants show competitive performance but do not exceed top recursive configurations on the dataset. These findings indicate that preserving document structure through recursive chunking is beneficial for practical RAG implementations, providing actionable guidance for chunk size selection while highlighting token-budget constraints in current long context models.
Keywords
Document chunking; Information retrieval; Large language model; Natural language processing; Question answering; Retrieval-augmented generation; Text segmentation
Full Text:
PDFDOI: http://doi.org/10.11591/ijai.v15.i2.pp1909-1918
Refbacks
- There are currently no refbacks.
Copyright (c) 2026 Erwann Lavarec, Yu Du

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
IAES International Journal of Artificial Intelligence (IJ-AI)
ISSN/e-ISSN 2089-4872/2252-8938
This journal is published by the Institute of Advanced Engineering and Science (IAES).