Fine-tuning CodeLlama for Multi-File Code Generation in a Private Repository

animeshj9 · September 6, 2023, 6:45am

Hello. I’m trying to finetune code llama for a multifile code generation task on my private repository.
The goal is to have the LLM generate code for some common bugs / issues across multiple files in my private repository.

Based on what I have been able to understand so far, the assumption is that doing this will require multiple stages of training / fine-tuning. I read the CodeLlama paper and am trying to create my own “specialization pipeline” for my repository and tasks.

The first fine-tuning will be done to give the model some comprehension about the repository structure (file paths, summary of what the file is doing and the code itself). This will require 100% code-coverage and the goal would be to have the model overfit. In this case, we will only look at model loss and will have no evaluation or test data sets.
Once the model has some comprehension about the repository structure, a second-pass task-specific fine-tuning can be done on a much smaller dataset which will be specific to the task. E.g. We can have the issues, old-code and refactored code as our dataset fields. We can then check for model loss, evaluation loss and the test results to measure the performance of the model.

The reason I want to do it this way is that while the fixes (the fixed code) is common, the files in which the code has to change might be different. So, the model needs to have some understanding of the files etc present in the repository.

Does this approach sound good or feasible? Are there alternative ways of doing this? If so, would you be able to point me to some resources that I can read and learn from.

Thanks.

Thinkcru · September 14, 2023, 3:31am

I have the same ideas for our code base. However, if you don’t mind me asking, how do you generate data into the model format? I cannot figure out how to take raw code and feed it into the model. The model looks like it has to have instruction prompts. What are your thoughts?

praveenkumarb · October 16, 2023, 7:51pm

What I gather from someone is that, your description becomes the prompt and the rest becomes the completion. Likewise, a smaller part of the code can be the prompt and the rest can be the completion. I haven’t tried this myself.

vmirea · October 26, 2023, 10:30am

I am trying to do the same thing. I understand you can fine tune the model wth raw data and I am trying to use GitHub - georgesung/llm_qlora: Fine-tuning LLMs using QLoRA
I succeeded to train it wit question - answer data and then it answers to the question like in the trained data
Then I tried to use 10k code lines as raw data for 3 epocs but if then I ask him to continue a function it doesn’t use the information from the raw sata. I’m not sure why

Thinkcru · October 26, 2023, 12:45pm

Interesting, I am going to check this out. It could be dealing with catastrophic forgetting.

Szymisilka · March 1, 2024, 10:24am

@animeshj9 @Thinkcru Hi guys, sorry for asking, I am facing a problem of creation the dataset for similar problem. Did you manage to create some result? How did it go?

worstkid92 · July 3, 2024, 1:59am

@Szymisilka Hi,sorry for interrupting.Do you have have updates on the question you ask? Did you mamaged to create the dataset using the code repo?

Szymisilka · July 3, 2024, 6:49am

@worstkid92 I have created a fine tuning dataset in a form of Question and Correct answer.
I have split the code using ast in python.
Then to every function / class I generated 3 questions using gemini.
Then I used keras FT to gemma 2b.
It was highly unsuccessful. The idea was that model would answer questions based on code, provide source code and example of usage.
It might have not worked because gemma sucks tho.

worstkid92 · July 3, 2024, 7:09am

I am trying to use this project. data-prep-kit/examples/demo_with_launcher.ipynb at dev · IBM/data-prep-kit · GitHub
dont know if it works

Szymisilka · July 3, 2024, 7:24am

Let me know about the result, I am curious too

animeshj9 · October 23, 2024, 10:16pm

So, after a lot of research, here is what we did (TLDR; it was anunsuccessful attempt):

We created a completion dataset from our repo i.e. input was a random length of code and output was random length of completion. For each dataset, we increased the input index by 1. The goal here was for the model to be able to memorize (overfit) on the code and answer basic questions.
For each function, we used the 405b model to do a summary (along with some metadata), and then used the code / summary as input output for finetuning.

With smaller datasets, we did see some signs of learning. But when we tried with the whole repo (the dataset was around 1million long). It got worse, and is not giving good responses now.

Hope this helps someone and please let me know if you folks tried something that worked. Thanks!

Topic		Replies	Views
Fine-Tuning LLMs on Large Proprietary Codebases Models	9	469	June 24, 2025
How to fine-tune a pretrained LLM on custom code libraries? Beginners	3	7640	April 26, 2025
Fine-tune code llama on private source code Beginners	3	2773	May 30, 2024
Fine tuning a LLM with a code Models	7	3457	February 5, 2025
Training Question/Answer on My Own Codebase Intermediate	0	242	March 29, 2024

Fine-tuning CodeLlama for Multi-File Code Generation in a Private Repository

Related topics