Fine-tuning CodeLlama for Multi-File Code Generation in a Private Repository

Hello. I’m trying to finetune code llama for a multifile code generation task on my private repository.
The goal is to have the LLM generate code for some common bugs / issues across multiple files in my private repository.

Based on what I have been able to understand so far, the assumption is that doing this will require multiple stages of training / fine-tuning. I read the CodeLlama paper and am trying to create my own “specialization pipeline” for my repository and tasks.

  1. The first fine-tuning will be done to give the model some comprehension about the repository structure (file paths, summary of what the file is doing and the code itself). This will require 100% code-coverage and the goal would be to have the model overfit. In this case, we will only look at model loss and will have no evaluation or test data sets.
  2. Once the model has some comprehension about the repository structure, a second-pass task-specific fine-tuning can be done on a much smaller dataset which will be specific to the task. E.g. We can have the issues, old-code and refactored code as our dataset fields. We can then check for model loss, evaluation loss and the test results to measure the performance of the model.

The reason I want to do it this way is that while the fixes (the fixed code) is common, the files in which the code has to change might be different. So, the model needs to have some understanding of the files etc present in the repository.

Does this approach sound good or feasible? Are there alternative ways of doing this? If so, would you be able to point me to some resources that I can read and learn from.

Thanks.

1 Like

I have the same ideas for our code base. However, if you don’t mind me asking, how do you generate data into the model format? I cannot figure out how to take raw code and feed it into the model. The model looks like it has to have instruction prompts. What are your thoughts?

What I gather from someone is that, your description becomes the prompt and the rest becomes the completion. Likewise, a smaller part of the code can be the prompt and the rest can be the completion. I haven’t tried this myself.

1 Like

I am trying to do the same thing. I understand you can fine tune the model wth raw data and I am trying to use GitHub - georgesung/llm_qlora: Fine-tuning LLMs using QLoRA
I succeeded to train it wit question - answer data and then it answers to the question like in the trained data
Then I tried to use 10k code lines as raw data for 3 epocs but if then I ask him to continue a function it doesn’t use the information from the raw sata. I’m not sure why

Interesting, I am going to check this out. It could be dealing with catastrophic forgetting.

@animeshj9 @Thinkcru Hi guys, sorry for asking, I am facing a problem of creation the dataset for similar problem. Did you manage to create some result? How did it go?