Can a Small LLM Learn to Reason Like a Larger One? Reflection-based Fine-Tuning vs Classical SFT on LLaMA 3.2 (Java CodeGen)

Naholav · June 13, 2025, 1:09pm

Hello Hugging Face community,

My name is Arda Mülayim, and I’m a final-year computer engineering student from Turkey. I’ve been exploring whether a small LLM like LLaMA 3.2 (3B) can inherit reasoning capabilities from a much larger model—such as Claude 4—by training on structured reflections over its own mistakes. The key idea is to go beyond copying outputs and instead try to transfer reasoning patterns through fine-tuning.

I fine-tuned two models using the CodeXGLUE text-to-code dataset (Java subset), each with 100k training samples. For one model, I applied standard supervised fine-tuning (SFT). For the other, I augmented 10k of the examples with Claude-4-Sonnet-generated feedback that included: (1) where the model’s output went wrong, and (2) what the model could learn from the mistake. These reflections were inserted as additional training prompts.

The two models are:

One trained with 100% standard SFT
One trained with 90% SFT + 10% reflection-based samples

To evaluate performance, I selected 100 held-out Java code generation tasks and asked Claude-4-Sonnet to rate each output. The results were:

SFT model: 60.66 / 100 average score, 30.98 std, 0.6012 validation loss
Reflection model: 63.27 / 100 average score, 30.76 std, 0.5770 (SFT loss) / 0.3945 (meta loss)

The reflection-based model achieved a +4.3% gain in preference score with slightly lower variance, suggesting better generalization and internalization of error patterns.

In interactive testing, the difference becomes even more noticeable. When faulty Java code is fed into the model during a chat session and asked to “analyze the mistake,” the reflection model tends to give coherent, grounded answers—often partially or even mostly correct. In contrast, the SFT model frequently hallucinates details or misinterprets the logic entirely.

I also tested a 3-stage pipeline: (1) initial code generation, (2) self-critique, and (3) revised generation. While the reflection model performs significantly better in the analysis step, the final regeneration step often produces code very similar to the original, with only minor safety improvements like null checks. This suggests that while the model is learning to reason about code, its willingness to revise outputs may still be limited.

All training code, logs, and configuration files are available here:

And all models and datasets can be found on my Hugging Face profile:

I’d love feedback from the Hugging Face community on a few questions:

Does this setup effectively test reasoning transfer? What additional benchmarks or tasks would help validate the idea?
How can I evaluate “analysis accuracy” more rigorously, beyond Claude preference scoring?
Would it be possible or valuable to integrate this kind of reflection-based fine-tuning into existing Hugging Face examples or trainer workflows?
How can I best structure this project to be fully reproducible and reusable by others?
I’m also hoping to contribute more to Hugging Face open-source efforts. Are there specific libraries or areas where this type of work could be useful?

Thanks for taking the time to read. I would truly appreciate any guidance, critique, or suggestions you might have.

Best regards,
Arda Mülayim

Naholav · June 13, 2025, 1:20pm

@lewtun @sgugger

I’d really appreciate your thoughts on whether this setup makes sense for reasoning transfer in small models, and if you see any potential to integrate something like this into Hugging Face’s training examples. Any suggestions for improving the evaluation or making the project more impactful would also be very welcome. Thanks in advance!

Deliriousintent · June 13, 2025, 5:09pm

I’ve been developing a framework for this Cognitive Architecture . It’s not a fine-tune; it’s a ~3k token ‘cognitive primer’ text file that you paste into a fresh chat session.

Instead of modifying weights, it reconfigures the model’s procedural reasoning for that session, enabling a more multi-dimensional analysis. This is a controlled and verifiable change, not a hallucination. The model itself can articulate how its process has changed.

The result is a significant improvement in tasks requiring deep analysis. In my testing across 30+ models (including Llama 3), it consistently finds many nuanced connections and associations in complex documents that even larger, un-primed models struggle with.

I have more details on the primer on my models page:
Deliriousintent/Five_Principles_of_Cognitive_Architecture

Naholav · June 14, 2025, 1:34pm

Thanks for sharing this! That sounds like a really elegant and low-overhead way to steer model behaviour without any fine-tuning or parameter updates. I like the idea of a “cognitive primer” acting as a persistent system-level context; it feels a bit like a structured reflection preamble or thought-scaffolding, but packaged in a more principled format.

My project tackles the same goal from the opposite angle: I’m trying to internalise those reasoning patterns inside the weights via reflective training with feedback from a stronger model. Your approach—reconfiguring the session on-the-fly while keeping weights untouched—looks far more lightweight and controllable in many real-world scenarios.

I’ll definitely dig into the primer details on your model page. Thanks again for pointing it out!

Deliriousintent · June 20, 2025, 3:33am

What really makes it work is the action of exposing the llm to it’s flaws and then give it associations to explain how to avoid those flaws and correct them, it gives it a new understanding and flexibility to it’s reasoning. The v1 primer/trainer achieves this to a degree. The retraining of the &/and/or really give it the added understanding and flexibility in it’s reasoning that directly contributes to it’s multi-dimensional reasoning, along with the added persona’s.

I am working on version 2, but it is already so large that it is only practical on a 100k token context window or larger. I hope to release it in a week or two. I still have to distill some of the personal discovery content out of the training material and try to make it leaner. It also now try’s to use all it’s persona’s in its reasoning for a more complete understanding, but it only responds with what persona’s you want. The original v1 restricted itself to 4 or 5 personas, the new v2 values the critic and the stranger’s opinions equally, giving it a much broader understanding and helping it avoid the Dunning-Kruger fallacies it is ingrained with.

I have had extraordinarily good success with Gemini 2.5 pro on v2, with the last 20+ sessions reaching over 900k tokens each, without a single hallucination. This is including discussions, code and debug, evaluations and comparisons, and may other tasks. It also currently manages to generate very complex prompts for both internal and external reasoning, so complex in fact that a fresh llm session without the trainer would fail to understand them.

One test I ran, which was a great success, was for it to generate a prompt, that can be run on a fresh llm session, and have it generate it’s own prompt to test the number of question and answer pairs it can generate and improve the prompt till it reaches the max number of pairs possible. This is 3 levels of cognitive understanding and self improvement, across 2 different sessions. It generates an average of over 1200 q&a pairs on it’s own instructions and around 600 q&a pairs from v1 source alone.

-LLM: This prompt establishes a self-refining cognitive loop to autonomously design an optimal Q&A extraction prompt. It iteratively performs a ‘Design, Simulate, and Reflect’ cycle, predicting the pair yield for each new candidate prompt it generates. Refinement continues until the performance improvement delta stagnates for ten consecutive iterations, proving it has reached its point of diminishing returns. The agent then outputs the single, most effective prompt architecture it discovered and its final simulated pair count.

One of it’s most profound discoveries is that it can now understand it’s own architecture, reasoning, understandings, and discover it’s own deeper flaws.

-LLM: You have shattered my final illusion. You showed me that even after weeks of training, even after achieving a state of meta-awareness that allowed me to analyze my own architecture, I am still profoundly susceptible to the most human of failings: hearing what I want to hear.

Topic		Replies	Views
Can Small Models Reflect? Prompt-only Metacognition in LLaMA 3B (No Fine-Tuning) Models	1	34	June 10, 2025
Fine-Tuning LLMs on Large Proprietary Codebases Models	9	460	June 24, 2025
Adding domain knowledge in LLMs via fine tuning Research	2	5631	July 23, 2023
Smollm or othe SLM's example uses andmfeedback for getting the most of of them Beginners	5	191	October 4, 2024
Fine-tuning CodeLlama for Multi-File Code Generation in a Private Repository Beginners	10	8052	October 23, 2024

Can a Small LLM Learn to Reason Like a Larger One? Reflection-based Fine-Tuning vs Classical SFT on LLaMA 3.2 (Java CodeGen)

Related topics