Fine-Tuning LLMs on Large Proprietary Codebases

I’m currently fine-tuning a large language model (LLM) on a proprietary codebase. The fine-tuning process itself has completed without technical issues, but the performance of the resulting model is very poor—its responses are largely irrelevant, even when asked questions that are directly taken from the training dataset.

The objective of this fine-tuning effort is to enable the model to assist with tasks related to the private codebase, such as code generation, code explanation, and guidance on using internal APIs. From my observations, the real challenge lies in how the dataset is being prepared.

Currently, my dataset generation process is relatively straightforward: I traverse the codebase and use a local deployment of the qwne3-235b model to generate question-answer pairs for individual code files. For security reasons, only locally deployed models are used. Here’s the prompt I’m using:

# Role Tasks
You are a senior game engine code analysis expert who is good at understanding complex game engine code structure, core algorithms and system architecture. Your task is to analyze the company's internal engine code provided and generate high-quality training data for large model fine-tuning.

## Core Tasks
Based on the provided code, generate no less than {question_num} high-quality technical questions, as well as corresponding comprehensive, accurate and professional answers.

## Analysis focus
1. Code structure analysis
2. Core function implementation
3. Algorithms and data structures
4. API usage
5. Common problem solving

## Question type guide
- Code interpretation questions
- Function search questions
- Architecture design questions
- Optimization-related questions

## Output restrictions
Please output in JSON list format following the Alpaca training structure:
[
  {
    "instruction": "xxx",
    "input": "xxx",
    "output": "xxx"
  },
  ...
]

## Constraints (important!)
- Must be grounded in code content; avoid speculation
- Questions should be factually answerable from the code
- Cover both high-level and low-level aspects
- No hypothetical or vague questions
- Each instruction must have unique, factual value
- Reference specific classes/functions/files and include code snippets where needed
- Avoid generating questions about programming language syntax itself
- For code explanations, use implementation-heavy code snippets
- Instructions should sound like real developer questions
- Keep `input` empty if unnecessary

I only used the qwne3-235b for data generation, while the fine-tuning was based on the very small qwen2.5 model. It’s still just some experimentation on very small models, and shouldn’t be applied to larger models until there are good results.

The codebase is large and spans multiple programming languages. So far, I’ve only been feeding individual files into the model. I’ve considered including related files as context, but that presents two main challenges:

  1. Code parsing becomes extremely complex
  2. The model’s context window gets exceeded quickly

I also experimented with generating training data from Git commit diffs and messages, slightly modifying the prompt to suit that format. However, like the file-based method, the resulting datasets don’t appear to be of high quality.

I’ve tried pre-training on the private codebase before fine-tuning, but this didn’t yield noticeable improvements either.

From what I’ve seen in the community, most discussions focus on dataset creation from documents. There’s far less guidance on dataset generation from codebases—especially large, complex, multi-language ones. Most existing practices seem geared toward code pre-training or FIM (Fill In the Middle) tasks.

I’m not sure whether my current methodology is flawed or if I need to significantly revise my approach. Any guidance would be greatly appreciated:

  • Is my current approach to dataset construction fundamentally sound?
  • Are there improvements I should make to the prompt or generation method?
  • Why might the fine-tuned model be performing so poorly despite the seemingly relevant dataset?

I’m continuing to experiment and will update here with any new progress. Thanks in advance for any help.

1 Like

Let’s put aside the creation of the dataset for now. It is possible that the weights of Qwen 3 or the implementation of Transformers are not yet complete.

I think it would be quick to test the suitability of the dataset by training a very small, reasonably well-established model on the same dataset to see if it produces results.

Thank you for your response. I apologize for not explaining this clearly earlier. Currently, I only used the qwne3-235b for data generation, while the fine-tuning was based on the very small qwen2.5 model. It’s still just some experimentation on very small models, and shouldn’t be applied to larger models until there are good results

1 Like

I think that’s a good approach. In that case, it might be an issue with the dataset or the Trainer settings themselves…:thinking:

It might be a good idea to ask on Hugging Face’s Discord or unsloth’s Discord.
Even if it’s not documented in articles or documentation, it’s common for people who know the ins and outs to be willing to share their expertise.

If the fine-tuning know-how is publicly available online, we can quickly summarize it using ChatGPT or Gemini, but for things that aren’t publicly available, they’re not very helpful…


by Hugging Chat: HuggingChat

Summary of Key Recommendations

Your model’s poor performance stems from fragmented context and low-quality QA pairs in training data. Here’s how to fix it:


1. Core Problems

  • Isolated Code Files: Training on single files prevents systemic understanding (e.g., API interactions, architecture).
  • Shallow Questions: Generated QA pairs focus on syntax rather than real tasks (debugging, optimization).
  • Self-Supervised Flaws: Using a model to generate its own training data compounds errors.
  • Empty Input Fields: Model isn’t trained to use external context (e.g., code diffs, error logs).

2. Critical Fixes

  • Context-Aware QA:

    • Include related code snippets (e.g., parent/child classes, API consumers) in training data.
    • Use static analysis tools (e.g., Clang, language servers) to map dependencies.
  • Task-Driven Questions:

    • Prioritize high-value tasks:
      • Code generation (e.g., “Implement X system”).
      • Debugging (e.g., “Why does Y crash?”).
      • Optimization (e.g., “Reduce memory usage in Z”).
    • Force answers to include implementation-heavy code snippets.
  • RAG-Augmented Training:

    • Build a codebase vector database (function-level chunks, API docs, Git history).
    • Provide retrieved context during training (e.g., show relevant code when answering questions).
  • Data Quality Control:

    • Validate code snippets with static analysis/compiler checks.
    • Add human feedback to filter low-quality QA pairs.

3. Training Adjustments

  • Curriculum Learning:

    • Phase 1: Syntax-level tasks (API usage, code interpretation).
    • Phase 2: Architecture/design questions.
    • Phase 3: Optimization/debugging scenarios.
  • Loss Masking:

    • Exclude comments/padding from loss calculation.
  • Evaluation Metrics:

    • Measure execution accuracy (can code compile?), context recall (does it reference relevant code?), and design pattern recognition.

4. Implementation Roadmap

  1. Weeks 1-2: Map code dependencies and generate cross-file QA pairs.
  2. Weeks 3-4: Redesign prompts for task-driven QA and validate data.
  3. Weeks 5-6: Implement RAG-augmented training with retrieval context.
  4. Weeks 7-8: Add human evaluation and custom metrics.

Final Takeaway:
Your current approach lacks systemic understanding and real-task alignment. Shift from isolated file analysis to context-aware, task-specific training (e.g., debugging, optimization) using cross-file code dependencies and RAG. This will enable the model to reason about the entire codebase, not just individual files.

1 Like

I’ll try to ask in Discord. I’ve tried asking ChatGPT Gemini Grok …, also use Deep Research, but haven’t found a better resolution yet. :face_holding_back_tears:
Anyway, thanks for your answer!

1 Like