A comparative study of large language models with chain-of thought prompting for automated program repair

Eko Darwiyanto, Rizky Akbar Gusnaen, Rio Nurtantyana

Abstract


Automatic code repair is an important task in software development to reduce bugs efficiently. This research focuses on developing and evaluating a chain-of-thought (CoT) prompting approach to improve the ability of large language models (LLMs) in automated program repair (APR) tasks. CoT prompting is a technique that guides LLM to generate step-by-step explanations before providing the final answer, so it is expected to improve the accuracy and quality of code repair. This research uses the QuixBugs dataset to evaluate the performance of several LLM models, including DeepSeek-V3 and GPT-4o, with two prompting methods, namely standard and CoT prompting. The evaluation is based on the average number of plausible patches generated as well as the estimated token usage cost. The results show that CoT prompting improves performance in most models compared with the standard. DeepSeek-V3 recorded the highest performance with an average of 36.6 plausible patches and the lowest cost of $0.006. GPT-4o also showed competitive results with an average of 35.8 plausible patches and a cost of $0.226. These results confirm that CoT prompting is an effective technique to improve LLM reasoning ability in APR tasks.

Keywords


Automated program repair; Chain-of-thought prompting; Large language models; Quixbugs; Standard prompting

Full Text:

PDF


DOI: http://doi.org/10.11591/ijai.v14.i6.pp4579-4589

Refbacks

  • There are currently no refbacks.


Copyright (c) 2025 Eko Darwiyanto, Rizky Akbar Gusnaen, Rio Nurtantyana

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

IAES International Journal of Artificial Intelligence (IJ-AI)
ISSN/e-ISSN 2089-4872/2252-8938 
This journal is published by the Institute of Advanced Engineering and Science (IAES).

View IJAI Stats