Fine-Tuning LLMs on Large Proprietary Codebases

jalenz · May 19, 2025, 5:17am

I’m currently fine-tuning a large language model (LLM) on a proprietary codebase. The fine-tuning process itself has completed without technical issues, but the performance of the resulting model is very poor—its responses are largely irrelevant, even when asked questions that are directly taken from the training dataset.

The objective of this fine-tuning effort is to enable the model to assist with tasks related to the private codebase, such as code generation, code explanation, and guidance on using internal APIs. From my observations, the real challenge lies in how the dataset is being prepared.

Currently, my dataset generation process is relatively straightforward: I traverse the codebase and use a local deployment of the qwne3-235b model to generate question-answer pairs for individual code files. For security reasons, only locally deployed models are used. Here’s the prompt I’m using:

# Role Tasks
You are a senior game engine code analysis expert who is good at understanding complex game engine code structure, core algorithms and system architecture. Your task is to analyze the company's internal engine code provided and generate high-quality training data for large model fine-tuning.

## Core Tasks
Based on the provided code, generate no less than {question_num} high-quality technical questions, as well as corresponding comprehensive, accurate and professional answers.

## Analysis focus
1. Code structure analysis
2. Core function implementation
3. Algorithms and data structures
4. API usage
5. Common problem solving

## Question type guide
- Code interpretation questions
- Function search questions
- Architecture design questions
- Optimization-related questions

## Output restrictions
Please output in JSON list format following the Alpaca training structure:
[
  {
    "instruction": "xxx",
    "input": "xxx",
    "output": "xxx"
  },
  ...
]

## Constraints (important!)
- Must be grounded in code content; avoid speculation
- Questions should be factually answerable from the code
- Cover both high-level and low-level aspects
- No hypothetical or vague questions
- Each instruction must have unique, factual value
- Reference specific classes/functions/files and include code snippets where needed
- Avoid generating questions about programming language syntax itself
- For code explanations, use implementation-heavy code snippets
- Instructions should sound like real developer questions
- Keep `input` empty if unnecessary

I only used the qwne3-235b for data generation, while the fine-tuning was based on the very small qwen2.5 model. It’s still just some experimentation on very small models, and shouldn’t be applied to larger models until there are good results.

The codebase is large and spans multiple programming languages. So far, I’ve only been feeding individual files into the model. I’ve considered including related files as context, but that presents two main challenges:

Code parsing becomes extremely complex
The model’s context window gets exceeded quickly

I also experimented with generating training data from Git commit diffs and messages, slightly modifying the prompt to suit that format. However, like the file-based method, the resulting datasets don’t appear to be of high quality.

I’ve tried pre-training on the private codebase before fine-tuning, but this didn’t yield noticeable improvements either.

From what I’ve seen in the community, most discussions focus on dataset creation from documents. There’s far less guidance on dataset generation from codebases—especially large, complex, multi-language ones. Most existing practices seem geared toward code pre-training or FIM (Fill In the Middle) tasks.

I’m not sure whether my current methodology is flawed or if I need to significantly revise my approach. Any guidance would be greatly appreciated:

Is my current approach to dataset construction fundamentally sound?
Are there improvements I should make to the prompt or generation method?
Why might the fine-tuned model be performing so poorly despite the seemingly relevant dataset?

I’m continuing to experiment and will update here with any new progress. Thanks in advance for any help.

John6666 · May 19, 2025, 8:32am

Let’s put aside the creation of the dataset for now. It is possible that the weights of Qwen 3 or the implementation of Transformers are not yet complete.

I think it would be quick to test the suitability of the dataset by training a very small, reasonably well-established model on the same dataset to see if it produces results.

github.com/unslothai/unsloth

Fail to train Qwen-3b with long context GRPO with 48GB VRAM.

opened 06:45PM - 17 Mar 25 UTC

LceOmlet

unsure bug?

I tried to train Qwen-2.5-3b with GRPO on the LIMO dataset using revised code fr…om the colab example: https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-GRPO.ipynb However, the context length can not be scaled up above 1.5k due to OOM errors, which is not an expected situation considering the blog (https://unsloth.ai/blog/grpo) using 54.3GB of VRAM for Llama 3.1 8B. The revised code I use collects VRAM usage messages which shows that the maximum allocation of VRAM by pytorch is far below 48GB. The "RuntimeError: CUDA error: out of memory" does not prompt the allocation behavior that causes this error with only a confusing stack trace pointing to a non-CUDA operation, which is shown in the debugging messages below at the end of this description. I am not familiar with the multi-worker mechanism of Pytorch could you please kindly point out what I have done wrong? I would appreciate any of your responses. Thanks. Code snip pasted below: ```python from unsloth import FastLanguageModel import torch import gc from functools import wraps import time from UnslothGRPOTrainer import ( UnslothGRPOConfig as GRPOConfig, UnslothGRPOTrainer as GRPOTrainer, ) from transformers import TrainerCallback import numpy as np from functools import wraps import re from datasets import load_dataset, Dataset # Load and prep dataset SYSTEM_PROMPT = """ Respond in the following format: <reasoning> ... </reasoning> <answer> ... </answer> """ XML_COT_FORMAT = """\ <reasoning> {reasoning} </reasoning> <answer> {answer} </answer> """ def extract_hash_answer(text: str) -> str | None: if "####" not in text: return None return text.split("####")[1].strip() def track_memory(func): @wraps(func) def wrapper(*args, **kwargs): print(f"\n=== Entering {func.__name__} ===") print(f"Memory before: {torch.cuda.memory_allocated() / 1024**3:.2f} GB") print(f"Memory reserved before: {torch.cuda.memory_reserved() / 1024**3:.2f} GB") start_time = time.time() result = func(*args, **kwargs) end_time = time.time() print(f"Time taken: {end_time - start_time:.2f} seconds") print(f"Memory after: {torch.cuda.memory_allocated() / 1024**3:.2f} GB") print(f"Memory reserved after: {torch.cuda.memory_reserved() / 1024**3:.2f} GB") print(f"=== Exiting {func.__name__} ===\n") return result return wrapper def get_gsm8k_questions(split = "train") -> Dataset: # data = load_dataset('openai/gsm8k', 'main')[split] # type: ignore data = load_dataset('GAIR/LIMO')[split] # type: ignore data = data.map(lambda x: { # type: ignore 'prompt': [ {'role': 'system', 'content': SYSTEM_PROMPT}, {'role': 'user', 'content': x['question']} ], # 'answer': extract_hash_answer(x['answer']) 'answer': x['answer'] }) # type: ignore return data # type: ignore @track_memory def load_dataset_with_tracking(): return get_gsm8k_questions() dataset = load_dataset_with_tracking() max_seq_length = 4096 * 2 # Can increase for longer reasoning traces lora_rank = 32 # Larger rank = smarter, but slower max_prompt_length = 512 # Moved here model, tokenizer = FastLanguageModel.from_pretrained( model_name = "Qwen/Qwen2.5-3B-Instruct", max_seq_length = max_seq_length, load_in_4bit = True, # False for LoRA 16bit fast_inference = True, # Enable vLLM fast inference max_lora_rank = lora_rank, gpu_memory_utilization = 0.10, # Reduce if out of memory float8_kv_cache=True ) model = FastLanguageModel.get_peft_model( model, r = lora_rank, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128 target_modules = [ # "q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj", ], # Remove QKVO if out of memory lora_alpha = lora_rank, # use_gradient_checkpointing=False, use_gradient_checkpointing = "unsloth", # Enable long context finetuning random_state = 3407, # float8_kv_cache=True ) def extract_xml_answer(text: str) -> str: answer = text.split("<answer>")[-1] answer = answer.split("</answer>")[0] return answer.strip() # uncomment middle messages for 1-shot prompting # Modify training configuration training_args = GRPOConfig( learning_rate = 5e-6, adam_beta1 = 0.9, adam_beta2 = 0.99, weight_decay = 0.1, warmup_ratio = 0.1, lr_scheduler_type = "cosine", optim = "paged_adamw_8bit", logging_steps = 1, per_device_eval_batch_size = 2, per_device_train_batch_size = 1, gradient_accumulation_steps = 1, # Increased for better memory management num_generations = 4, # Reduced to help identify memory issues max_prompt_length = max_prompt_length, max_completion_length = max_seq_length - max_prompt_length, max_steps = 1000000, # Reduced for testing save_steps = 50, max_grad_norm = 0.1, report_to = "none", output_dir = "outputs", ) # Add memory tracking before model initialization print("\n=== Model Initialization Memory Usage ===") print(f"Memory before model init: {torch.cuda.memory_allocated() / 1024**3:.2f} GB") print(f"Memory reserved before model init: {torch.cuda.memory_reserved() / 1024**3:.2f} GB") # Add memory tracking for dataset loading # Reward functions def correctness_reward_func(prompts, completions, answer, **kwargs) -> list[float]: responses = [completion[0]['content'] for completion in completions] q = prompts[0][-1]['content'] extracted_responses = [extract_xml_answer(r) for r in responses] print('-'*20, f"Question:\n{q}", f"\nAnswer:\n{answer[0]}", f"\nResponse:\n{responses[0]}", f"\nExtracted:\n{extracted_responses[0]}") return [2.0 if r == a else 0.0 for r, a in zip(extracted_responses, answer)] def int_reward_func(completions, **kwargs) -> list[float]: responses = [completion[0]['content'] for completion in completions] extracted_responses = [extract_xml_answer(r) for r in responses] return [0.5 if r.isdigit() else 0.0 for r in extracted_responses] def strict_format_reward_func(completions, **kwargs) -> list[float]: """Reward function that checks if the completion has a specific format.""" pattern = r"^<reasoning>\n.*?\n</reasoning>\n<answer>\n.*?\n</answer>\n$" responses = [completion[0]["content"] for completion in completions] matches = [re.match(pattern, r) for r in responses] return [0.5 if match else 0.0 for match in matches] def soft_format_reward_func(completions, **kwargs) -> list[float]: """Reward function that checks if the completion has a specific format.""" pattern = r"<reasoning>.*?</reasoning>\s*<answer>.*?</answer>" responses = [completion[0]["content"] for completion in completions] matches = [re.match(pattern, r) for r in responses] return [0.5 if match else 0.0 for match in matches] def count_xml(text) -> float: count = 0.0 if text.count("<reasoning>\n") == 1: count += 0.125 if text.count("\n</reasoning>\n") == 1: count += 0.125 if text.count("\n<answer>\n") == 1: count += 0.125 count -= len(text.split("\n</answer>\n")[-1])*0.001 if text.count("\n</answer>") == 1: count += 0.125 count -= (len(text.split("\n</answer>")[-1]) - 1)*0.001 return count def xmlcount_reward_func(completions, **kwargs) -> list[float]: contents = [completion[0]["content"] for completion in completions] return [count_xml(c) for c in contents] # Add memory tracking to key functions @track_memory def compute_loss(self, *args, **kwargs): return super().compute_loss(*args, **kwargs) # Modify the trainer class to include memory tracking class MemoryTrackedGRPOTrainer(GRPOTrainer): def __init__(self, *args, **kwargs): super().__init__(*args, **kwargs) self.compute_loss = track_memory(self.compute_loss) # Replace the trainer instantiation with the memory tracked version trainer = MemoryTrackedGRPOTrainer( model = model, processing_class = tokenizer, reward_funcs = [ xmlcount_reward_func, soft_format_reward_func, strict_format_reward_func, int_reward_func, correctness_reward_func, ], args = training_args, train_dataset = dataset, ) # Add memory tracking to the training loop @track_memory def train_with_memory_tracking(): trainer.train() # Replace trainer.train() with the tracked version train_with_memory_tracking() text = tokenizer.apply_chat_template([ {"role" : "user", "content" : "Which is bigger? 9.11 or 9.9?"}, ], tokenize = False, add_generation_prompt = True) ``` The debugging messages: ``` === Model Initialization Memory Usage === Memory before model init: 3.93 GB Memory reserved before model init: 4.02 GB === Entering train_with_memory_tracking === Memory before: 3.93 GB Memory reserved before: 4.02 GB 0%| | 0/1000000 [00:00<?, ?it/s] 0%| | 1/1000000 [00:14<4033:48:31, 14.52s/it] 0%| | 1/1000000 [00:14<4033:48:31, 14.52s/it][rank0]: Traceback (most recent call last): [rank0]: File "<frozen runpy>", line 198, in _run_module_as_main [rank0]: File "<frozen runpy>", line 88, in _run_code [rank0]: File "/home/liangchen/ts_agent/train_phi4.py", line 236, in <module> [rank0]: train_with_memory_tracking() [rank0]: File "/home/liangchen/ts_agent/train_phi4.py", line 48, in wrapper [rank0]: result = func(*args, **kwargs) ... [rank0]: File "/home/liangchen/miniconda3/envs/ts_agent/lib/python3.11/site-packages/unsloth_zoo/gradient_checkpointing.py", line 767, in unsloth_checkpoint [rank0]: return UnslothCheckpointFunction.apply(function, preserve, *args) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/home/liangchen/miniconda3/envs/ts_agent/lib/python3.11/site-packages/torch/autograd/function.py", line 575, in apply [rank0]: return super().apply(*args, **kwargs) # type: ignore[misc] [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/home/liangchen/miniconda3/envs/ts_agent/lib/python3.11/site-packages/unsloth_zoo/gradient_checkpointing.py", line 441, in forward [rank0]: if new_size > x.numel(): x.resize_(new_size) [rank0]: ^^^^^^^^^^^^^^^^^^^ [rank0]: RuntimeError: CUDA error: out of memory [rank0]: CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. [rank0]: For debugging consider passing CUDA_LAUNCH_BLOCKING=1 [rank0]: Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. -------------------- Question: Triangle $ABC$ has $AB=21$ , ... === Entering compute_loss === Memory before: 4.05 GB Memory reserved before: 4.97 GB input_ids.shape:torch.Size([4, 1040]) [EOF] ```

jalenz · May 19, 2025, 9:07am

Thank you for your response. I apologize for not explaining this clearly earlier. Currently, I only used the qwne3-235b for data generation, while the fine-tuning was based on the very small qwen2.5 model. It’s still just some experimentation on very small models, and shouldn’t be applied to larger models until there are good results

John6666 · May 19, 2025, 9:16am

I think that’s a good approach. In that case, it might be an issue with the dataset or the Trainer settings themselves…

It might be a good idea to ask on Hugging Face’s Discord or unsloth’s Discord.
Even if it’s not documented in articles or documentation, it’s common for people who know the ins and outs to be willing to share their expertise.

If the fine-tuning know-how is publicly available online, we can quickly summarize it using ChatGPT or Gemini, but for things that aren’t publicly available, they’re not very helpful…

by Hugging Chat: HuggingChat

Summary of Key Recommendations

Your model’s poor performance stems from fragmented context and low-quality QA pairs in training data. Here’s how to fix it:

1. Core Problems

Isolated Code Files: Training on single files prevents systemic understanding (e.g., API interactions, architecture).
Shallow Questions: Generated QA pairs focus on syntax rather than real tasks (debugging, optimization).
Self-Supervised Flaws: Using a model to generate its own training data compounds errors.
Empty Input Fields: Model isn’t trained to use external context (e.g., code diffs, error logs).

2. Critical Fixes

Context-Aware QA:
- Include related code snippets (e.g., parent/child classes, API consumers) in training data.
- Use static analysis tools (e.g., Clang, language servers) to map dependencies.
Task-Driven Questions:
- Prioritize high-value tasks:
  - Code generation (e.g., “Implement X system”).
  - Debugging (e.g., “Why does Y crash?”).
  - Optimization (e.g., “Reduce memory usage in Z”).
- Force answers to include implementation-heavy code snippets.
RAG-Augmented Training:
- Build a codebase vector database (function-level chunks, API docs, Git history).
- Provide retrieved context during training (e.g., show relevant code when answering questions).
Data Quality Control:
- Validate code snippets with static analysis/compiler checks.
- Add human feedback to filter low-quality QA pairs.

3. Training Adjustments

Curriculum Learning:
- Phase 1: Syntax-level tasks (API usage, code interpretation).
- Phase 2: Architecture/design questions.
- Phase 3: Optimization/debugging scenarios.
Loss Masking:
- Exclude comments/padding from loss calculation.
Evaluation Metrics:
- Measure execution accuracy (can code compile?), context recall (does it reference relevant code?), and design pattern recognition.

4. Implementation Roadmap

Weeks 1-2: Map code dependencies and generate cross-file QA pairs.
Weeks 3-4: Redesign prompts for task-driven QA and validate data.
Weeks 5-6: Implement RAG-augmented training with retrieval context.
Weeks 7-8: Add human evaluation and custom metrics.

Final Takeaway:
Your current approach lacks systemic understanding and real-task alignment. Shift from isolated file analysis to context-aware, task-specific training (e.g., debugging, optimization) using cross-file code dependencies and RAG. This will enable the model to reason about the entire codebase, not just individual files.

jalenz · May 19, 2025, 9:35am

I’ll try to ask in Discord. I’ve tried asking ChatGPT Gemini Grok …, also use Deep Research, but haven’t found a better resolution yet.
Anyway, thanks for your answer!

Pimpcat-AU · June 10, 2025, 8:14pm

Fix:

Your current approach (auto-generating Q&A pairs from single files) produces synthetic data with low diversity and context loss.

For better results:

    Include neighboring/related files in context windows (where feasible).

    Add real human-written questions/answers or high-quality curated examples, not just auto-generated ones.

    Mix in Fill-In-the-Middle or code-completion style samples, not only Q&A format.

    Filter/generated data for relevance/quality, and balance with some language-only (docstring, commit) data.

Model performance is poor because synthetic Q&A lacks true variation and may not match real user queries or context complexity.

Solution provided by Triskel Data Deterministic AI.

Gayanukaa · June 24, 2025, 3:10pm

Hi, I’m in a similar situation to finetune a model on a private codebase for guidance on using internal APIs

I want to check if it’s possible to change on the code files itself without generating QA pairs.
With QA pairs is also fine if the outputs are good but more lenient towards without QA pairs.

Were you successful in achieving good results where the methodology could be shared. Greatly appreciated!

jalenz · June 24, 2025, 3:32pm

We made some further attempts, but the results were not very good. Therefore, we did not continue. We are currently experimenting with a solution that indexes code based on questions and then provides it as context to LLM.

We are also simultaneously generating question-answer pairs based on internal documents for model fine-tuning, and this is progressing well with good results.

Gayanukaa · June 24, 2025, 3:38pm

Any technique or sample repository you can share on the - indexes code based on questions and then provides it as context to LLM.

I would like to try that as well

jalenz · June 24, 2025, 3:54pm

The project we are referencing is cline

Topic		Replies	Views
Fine-tuning CodeLlama for Multi-File Code Generation in a Private Repository Beginners	10	8021	October 23, 2024
Fine tuning a LLM with a code Models	7	3442	February 5, 2025
Seeking Advice on Fine-Tuning LLMs for Generating Documents Beginners	1	121	February 15, 2025
How to fine-tune a pretrained LLM on custom code libraries? Beginners	3	7475	April 26, 2025
Need Advice on Fine-Tuning for DSL Beginners	8	189	March 7, 2025