Fix:
Your current approach (auto-generating Q&A pairs from single files) produces synthetic data with low diversity and context loss.
For better results:
Include neighboring/related files in context windows (where feasible).
Add real human-written questions/answers or high-quality curated examples, not just auto-generated ones.
Mix in Fill-In-the-Middle or code-completion style samples, not only Q&A format.
Filter/generated data for relevance/quality, and balance with some language-only (docstring, commit) data.
Model performance is poor because synthetic Q&A lacks true variation and may not match real user queries or context complexity.
Solution provided by Triskel Data Deterministic AI.