Skip to content

Deepseek-Coder 6.7B base fill-in-the-middle fine-tuning for automated program repair

Notifications You must be signed in to change notification settings

ardalaan/DeepSeekCoder6.7B_APR_FIM_finetuning

Repository files navigation

Fine-tuned model is available via huggingface:
https://huggingface.co/ardalaaan/deepseek-coder-6.7b-base-APR-FIM-finetuning
Emergant code language models have shown superior performance in code-related tasks such as code and documentation generation, fault localization, automated program repair (APR) and many others. [1]
In this project, DeepSeek-Coder-6.7B-base language model [2] has been fine-tuned on Megadiff dataset [3] with infilling objective, particularly for improving the model’s performance in automated program repair task.
This work is meant to follow and improve the work done in RepairLlama paper [4]. RepairLlama uses a parameter-efficient fine-tuning technique called LoRA [5] for fine-tuning CodeLlama-7B-base [6] on a dataset consisting of pairs of buggy functions and their corresponding correct patches.
RepairLlama reasons that in practice the faulty region of the code could be detected by existing fault localization methods and thus so-called fault localization signals could be provided to the model by commenting out the buggy lines where the bug rests and keeping them in the input function fed to the model. By subsequently asking the model to fill in the blank caused by the commented lines, it learns to take advantage of the provided comments to generate a candidate patch more effectively.
RepairLlama explains that CodeLlama has been selected on the grounds that it meets a few criteria, namely being a language model that has been extensively pre-trained on code corpora and also having infilling objective as part of its pre-training process; As code infilling is considered a valuable ability when it comes to the task of program repair.
We use DeepSeek-Coder 6.7B, a more recent and more performant model that shares the same criteria and has been shown to enjoy from considerably superior code infilling ability. [7]
There has been a method proposed for teaching an autoregressive model (a model that only considers left context when generating a new token) to perform infilling (considering both left and right context), commonly known as fill-in-the-middle (FIM) transformation. [8] This method has been adopted by both CodeLlama and DeepSeek Coder along with many other language models as the de facto approach to teaching a model to perform infilling.
While RepairLlama merely comments out the buggy section of each input function and follows the comments with a <fill_me> token, we further align the APR fine-tuning with the model’s original infilling fine-tuning by incorporating the FIM transformation into the preliminary processing of dataset entries both while fine-tuning and inference.
The original FIM transformation randomly divides each model input into three sections namely prefix, middle and suffix and then moves the middle part to the end, so as to train the model to generate the middle part by first having seen the prefix and suffix parts. While in this work the desired middle part (i.e. the buggy section) can’t be determined randomly as it is assumed to be determined by fault localization tools. Hence the original algorithm was modified to reflect this difference.
FIM transformation introduces two hyperparameters, namely FIM rate and PSM rate. The former determines the proportion of input functions that get affected by the FIM transformation and the latter decides what percentage of transformed inputs get to be in “prefix, suffix, middle” (PSM) order as opposed to the alternative “suffix, prefix, middle” (SPM) form. FIM rate and PSM rate were set to 0.9 and 0.5 respectively as recommended by Bavarian et al.
Fine-tuning was operated through Unsloth library which enabled the computation to be processed on a single NVIDIA H100 SXM GPU with optimiezed memory and time consumption and automated hyperparameter adjustment.
HumanEval Java was chosen as the benchmark, since it suffers from less data leakage than its counterparts due to being released more recently.
Benchmark results for the base and fine-tuned model don’t show any meaningful difference between them. Also another fine-tuned model was trained with the only difference of not performing FIM transformation (similar to RepairLlama) as a reference point for comparing the performance of the fine-tuned models in APR benchmarks. Alas, this model was barely able to generate any compilable patches for the benchmark buggy functions. This is most probably due to my lack of understanding of what I am doing and not the deficiency of RepairLlama’s proposed fine-tuning method.
Code in fim_transform_finetune.py and fim.py files was largely adapted from LLM workshop github repository [9] and subsequently modified as described above.
[1] Zhang, Q. et al. 2023. A survey on large language models for Software engineering. arXiv (Cornell University). (Jan. 2023). DOI:https://doi.org/10.48550/arxiv.2312.15223.
[2] Guo, D. et al. 2024. DeepSeek-Coder: When the large language model meets programming -- the rise of code intelligence. arXiv (Cornell University). (Jan. 2024). DOI:https://doi.org/10.48550/arxiv.2401.14196.
[3] Monperrus, M. et al. 2021. MegaDiff: A dataset of 600k Java source code changes categorized by diff size. arXiv (Cornell University). (Jan. 2021). DOI:https://doi.org/10.48550/arxiv.2108.04631.
[4] Silva, A. et al. 2023. RepairLLAMA: Efficient Representations and Fine-Tuned Adapters for Program repair. arXiv (Cornell University). (Jan. 2023). DOI:https://doi.org/10.48550/arxiv.2312.15698.
[5] Hu, E.J. et al. 2021. LORA: Low-Rank adaptation of Large Language Models. arXiv (Cornell University). (Jan. 2021). DOI:https://doi.org/10.48550/arxiv.2106.09685.
[6] Rozière, B. et al. 2023. Code llama: Open Foundation Models for code. arXiv (Cornell University). (Jan. 2023). DOI:https://doi.org/10.48550/arxiv.2308.12950.
[7] Gong, L. et al. 2024. Evaluation of LLMs on Syntax-Aware Code Fill-in-the-Middle tasks. arXiv (Cornell University). (Mar. 2024). DOI:https://doi.org/10.48550/arxiv.2403.04814.
[8] Bavarian, M. et al. 2022. Efficient training of language models to fill in the middle. arXiv (Cornell University). (Jan. 2022). DOI:https://doi.org/10.48550/arxiv.2207.14255.
[9] https://github.com/pacman100/LLM-Workshop

About

Deepseek-Coder 6.7B base fill-in-the-middle fine-tuning for automated program repair

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published