Fine-Tuning LLMs on Large Proprietary Codebases

Fix:

Your current approach (auto-generating Q&A pairs from single files) produces synthetic data with low diversity and context loss.

For better results:

    Include neighboring/related files in context windows (where feasible).

    Add real human-written questions/answers or high-quality curated examples, not just auto-generated ones.

    Mix in Fill-In-the-Middle or code-completion style samples, not only Q&A format.

    Filter/generated data for relevance/quality, and balance with some language-only (docstring, commit) data.

Model performance is poor because synthetic Q&A lacks true variation and may not match real user queries or context complexity.

Solution provided by Triskel Data Deterministic AI.

1 Like