Fine-Tuning LLMs on Large Proprietary Codebases

Pimpcat-AU · June 10, 2025, 8:14pm

Fix:

Your current approach (auto-generating Q&A pairs from single files) produces synthetic data with low diversity and context loss.

For better results:

    Include neighboring/related files in context windows (where feasible).

    Add real human-written questions/answers or high-quality curated examples, not just auto-generated ones.

    Mix in Fill-In-the-Middle or code-completion style samples, not only Q&A format.

    Filter/generated data for relevance/quality, and balance with some language-only (docstring, commit) data.

Model performance is poor because synthetic Q&A lacks true variation and may not match real user queries or context complexity.

Solution provided by Triskel Data Deterministic AI.

Topic		Replies	Views
Fine-tuning CodeLlama for Multi-File Code Generation in a Private Repository Beginners	10	7921	October 23, 2024
Fine tuning a LLM with a code Models	7	3406	February 5, 2025
Seeking Advice on Fine-Tuning LLMs for Generating Documents Beginners	1	114	February 15, 2025
Need Advice on Fine-Tuning for DSL Beginners	8	114	March 7, 2025
How to fine-tune a pretrained LLM on custom code libraries? Beginners	3	7010	April 26, 2025

Fine-Tuning LLMs on Large Proprietary Codebases

Related topics