Hello Hugging Face community,
My name is Arda Mülayim, and I’m a final-year computer engineering student from Turkey. I’ve been exploring whether a small LLM like LLaMA 3.2 (3B) can inherit reasoning capabilities from a much larger model—such as Claude 4—by training on structured reflections over its own mistakes. The key idea is to go beyond copying outputs and instead try to transfer reasoning patterns through fine-tuning.
I fine-tuned two models using the CodeXGLUE text-to-code dataset (Java subset), each with 100k training samples. For one model, I applied standard supervised fine-tuning (SFT). For the other, I augmented 10k of the examples with Claude-4-Sonnet-generated feedback that included: (1) where the model’s output went wrong, and (2) what the model could learn from the mistake. These reflections were inserted as additional training prompts.
The two models are:
- One trained with 100% standard SFT
- One trained with 90% SFT + 10% reflection-based samples
To evaluate performance, I selected 100 held-out Java code generation tasks and asked Claude-4-Sonnet to rate each output. The results were:
- SFT model: 60.66 / 100 average score, 30.98 std, 0.6012 validation loss
- Reflection model: 63.27 / 100 average score, 30.76 std, 0.5770 (SFT loss) / 0.3945 (meta loss)
The reflection-based model achieved a +4.3% gain in preference score with slightly lower variance, suggesting better generalization and internalization of error patterns.
In interactive testing, the difference becomes even more noticeable. When faulty Java code is fed into the model during a chat session and asked to “analyze the mistake,” the reflection model tends to give coherent, grounded answers—often partially or even mostly correct. In contrast, the SFT model frequently hallucinates details or misinterprets the logic entirely.
I also tested a 3-stage pipeline: (1) initial code generation, (2) self-critique, and (3) revised generation. While the reflection model performs significantly better in the analysis step, the final regeneration step often produces code very similar to the original, with only minor safety improvements like null checks. This suggests that while the model is learning to reason about code, its willingness to revise outputs may still be limited.
All training code, logs, and configuration files are available here:
And all models and datasets can be found on my Hugging Face profile:
I’d love feedback from the Hugging Face community on a few questions:
- Does this setup effectively test reasoning transfer? What additional benchmarks or tasks would help validate the idea?
- How can I evaluate “analysis accuracy” more rigorously, beyond Claude preference scoring?
- Would it be possible or valuable to integrate this kind of reflection-based fine-tuning into existing Hugging Face examples or trainer workflows?
- How can I best structure this project to be fully reproducible and reusable by others?
- I’m also hoping to contribute more to Hugging Face open-source efforts. Are there specific libraries or areas where this type of work could be useful?
Thanks for taking the time to read. I would truly appreciate any guidance, critique, or suggestions you might have.
Best regards,
Arda Mülayim