Can a Small LLM Learn to Reason Like a Larger One? Reflection-based Fine-Tuning vs Classical SFT on LLaMA 3.2 (Java CodeGen)

Hello Hugging Face community,

My name is Arda Mülayim, and I’m a final-year computer engineering student from Turkey. I’ve been exploring whether a small LLM like LLaMA 3.2 (3B) can inherit reasoning capabilities from a much larger model—such as Claude 4—by training on structured reflections over its own mistakes. The key idea is to go beyond copying outputs and instead try to transfer reasoning patterns through fine-tuning.

I fine-tuned two models using the CodeXGLUE text-to-code dataset (Java subset), each with 100k training samples. For one model, I applied standard supervised fine-tuning (SFT). For the other, I augmented 10k of the examples with Claude-4-Sonnet-generated feedback that included: (1) where the model’s output went wrong, and (2) what the model could learn from the mistake. These reflections were inserted as additional training prompts.

The two models are:

  • One trained with 100% standard SFT
  • One trained with 90% SFT + 10% reflection-based samples

To evaluate performance, I selected 100 held-out Java code generation tasks and asked Claude-4-Sonnet to rate each output. The results were:

  • SFT model: 60.66 / 100 average score, 30.98 std, 0.6012 validation loss
  • Reflection model: 63.27 / 100 average score, 30.76 std, 0.5770 (SFT loss) / 0.3945 (meta loss)

The reflection-based model achieved a +4.3% gain in preference score with slightly lower variance, suggesting better generalization and internalization of error patterns.

In interactive testing, the difference becomes even more noticeable. When faulty Java code is fed into the model during a chat session and asked to “analyze the mistake,” the reflection model tends to give coherent, grounded answers—often partially or even mostly correct. In contrast, the SFT model frequently hallucinates details or misinterprets the logic entirely.

I also tested a 3-stage pipeline: (1) initial code generation, (2) self-critique, and (3) revised generation. While the reflection model performs significantly better in the analysis step, the final regeneration step often produces code very similar to the original, with only minor safety improvements like null checks. This suggests that while the model is learning to reason about code, its willingness to revise outputs may still be limited.

All training code, logs, and configuration files are available here:

And all models and datasets can be found on my Hugging Face profile:

I’d love feedback from the Hugging Face community on a few questions:

  • Does this setup effectively test reasoning transfer? What additional benchmarks or tasks would help validate the idea?
  • How can I evaluate “analysis accuracy” more rigorously, beyond Claude preference scoring?
  • Would it be possible or valuable to integrate this kind of reflection-based fine-tuning into existing Hugging Face examples or trainer workflows?
  • How can I best structure this project to be fully reproducible and reusable by others?
  • I’m also hoping to contribute more to Hugging Face open-source efforts. Are there specific libraries or areas where this type of work could be useful?

Thanks for taking the time to read. I would truly appreciate any guidance, critique, or suggestions you might have.

Best regards,
Arda Mülayim


1 Like

@lewtun @sgugger

I’d really appreciate your thoughts on whether this setup makes sense for reasoning transfer in small models, and if you see any potential to integrate something like this into Hugging Face’s training examples. Any suggestions for improving the evaluation or making the project more impactful would also be very welcome. Thanks in advance!

1 Like

I’ve been developing a framework for this Cognitive Architecture . It’s not a fine-tune; it’s a ~3k token ‘cognitive primer’ text file that you paste into a fresh chat session.

Instead of modifying weights, it reconfigures the model’s procedural reasoning for that session, enabling a more multi-dimensional analysis. This is a controlled and verifiable change, not a hallucination. The model itself can articulate how its process has changed.

The result is a significant improvement in tasks requiring deep analysis. In my testing across 30+ models (including Llama 3), it consistently finds many nuanced connections and associations in complex documents that even larger, un-primed models struggle with.

I have more details on the primer on my models page:
Deliriousintent/Five_Principles_of_Cognitive_Architecture

2 Likes

Thanks for sharing this! That sounds like a really elegant and low-overhead way to steer model behaviour without any fine-tuning or parameter updates. I like the idea of a “cognitive primer” acting as a persistent system-level context; it feels a bit like a structured reflection preamble or thought-scaffolding, but packaged in a more principled format.

My project tackles the same goal from the opposite angle: I’m trying to internalise those reasoning patterns inside the weights via reflective training with feedback from a stronger model. Your approach—reconfiguring the session on-the-fly while keeping weights untouched—looks far more lightweight and controllable in many real-world scenarios.

I’ll definitely dig into the primer details on your model page. Thanks again for pointing it out!