Inverse correlation during in-training evaluation: low token accuracy and high IFEval accuracy with reversed results in post-training evaluation

I have a bit of a puzzle here and would be happy to hear from the knowledgeable people around here.

I examine the effect of training a LoRA adapter with different additional percentage of replay buffer, I employ both in and post training evaluations using the IFEval strict prompt evaluation. I understand that the generation flow is completely different between in and post training, but I was expecting different behaviors, explanations with experiments I can run to verify them would be highly expected.

I’m using SFTConfig and SFTTrainer from the trl package for all configuration and experiments. Overall I ran 4 training runs with 0%, 5%, 18% and 50% replay buffer (X% addition to the original dataset in terms of training examples), the replay buffer was taken from the SMOL2 dataset.

I see two phenomenons which I’m puzzled about:

  1. Low token accuracy is correlated with higher IFeval accuracy - I’ve anticipated that lower token accuracy means less worse generation quality and more instruction following errors, in reality the 5% buffer has significantly lower token accuracy but higher instruction following accuracy.
  2. No correlation between the in and post training IFEval accuracy, in the after training IFEval evaluation the 50% buffer reached the best performance (0.675) while the 5% and 18% got less (0.66) and the no buffer got the lowest (0.61). While post-training results make sense they are not correlated with the in-training accuracy as the no-buffer is lowest on all tests while the top performing variant is not the same on both in and post training.

Adding the W&B monitoring for reference.

1 Like

I ran some tests on Colab. I wasn’t able to reproduce the phenomenon entirely, but I was able to reproduce it partially.


What you are seeing is coherent. It is not a paradox. It is a case where the training-time metric and the evaluation-time metric are measuring different abilities, and replay is changing the balance between them. TRL’s SFTTrainer is still optimizing next-token prediction, and for prompt-completion data it computes loss on the completion tokens only by default. IFEval strict, by contrast, scores whether a freely generated answer satisfies all verifiable instructions in a prompt, over a 541-prompt benchmark with 25 instruction types. Those two objectives can disagree sharply. (Hugging Face)

The background

What mean_token_accuracy is really measuring

In TRL SFT, mean_token_accuracy is a teacher-forced top-1 token match on labeled tokens. In plain terms, the model sees the gold prefix and is rewarded for predicting the exact next target token. For prompt-completion datasets, the trainer computes loss on the completion only unless you disable completion_only_loss. That makes the metric mainly a measure of reference imitation, not of free-generation behavior. (Hugging Face)

What IFEval strict is really measuring

IFEval strict is a prompt-level all-constraints-pass metric. A prompt is counted as correct only if every verifiable instruction is followed. The benchmark was built around things like required keywords, exact bullet counts, formatting, case changes, punctuation, start/end constraints, and length constraints. The paper also defines a loose version because strict scoring is brittle to harmless details such as markdown markers, a leading line like “Sure, here it is:”, or a trailing line like “Hope it helps.” (arXiv)

That already explains most of your first phenomenon. A model can be worse at reproducing one labeled completion token-by-token, yet better at satisfying the benchmark’s explicit output constraints. (Hugging Face)

Why lower token accuracy can come with higher IFEval

The key is one reference vs many valid outputs.

Teacher-forced token accuracy assumes there is one preferred continuation. IFEval does not. If the prompt says “write exactly four bullet points” or “end with a postscript starting with P.S.” then there are many valid answers. A model can generate a very different answer from the reference and still satisfy the prompt perfectly. In that case, token accuracy falls while IFEval rises. That is exactly the kind of separation the benchmark was designed to expose. (arXiv)

Your replay source makes this even more likely. The SmolTalk dataset card says the smol-constraints subset trains models to follow explicit constraints such as fixed numbers of sentences or words and required words in the output, and that it was decontaminated against IFEval. That means your replay is not generic replay. It is replay with training signal that overlaps strongly with the type of behavior IFEval rewards. (Hugging Face)

So a result like “5% replay has worse token accuracy but better IFEval” is not odd. A small amount of replay can push the model away from the exact reference completions while improving structural obedience to prompts. (Hugging Face)

Why in-training and post-training IFEval can disagree

There are two broad reasons.

1. Real learning-dynamics differences

Replay changes the effective training objective. With no replay, the adapter can fit the new dataset more aggressively, but it can also forget prior instruction-following habits more aggressively. With more replay, the adapter may preserve those habits better, but adapt more slowly to the new distribution. That means one replay ratio can look best early, while another looks best at the end. Your pattern of 5% looking stronger in one phase and 50% winning the final evaluation fits that logic. (Hugging Face)

2. Evaluation-pipeline differences

This is the part I would be most careful about. Public TRL issue reports show that users trying to compute generation-based metrics inside SFTTrainer ran into the fact that compute_metrics receives logits rather than generations, and that predict_with_generate did not behave the way they expected in the SFT path. Another issue shows a masking pitfall: when using completion-only collation, labels can still be present in input_ids, which matters if you try to do generation-based evaluation through the trainer. So “in-training IFEval” and “external post-training IFEval” may not actually be the same measurement path. (GitHub)

IFEval is also sensitive to generation settings. Hugging Face’s generation docs note that max_new_tokens, EOS handling, beam settings, and sampling settings directly affect the generated output. Since IFEval strict is sensitive to output format and truncation, a small change in generation config can move the score. (Hugging Face)

There is one more practical source of drift: LoRA loading. A PEFT issue documents cases where merged and unmerged LoRA inference produced different evaluation outputs even though users expected them to be identical. So if your post-training evaluation loads adapters differently from the in-training path, that can also contribute to ranking reversals. (GitHub)

My read of your two observations

Observation 1: lower token accuracy, higher IFEval

My interpretation is:

  • token accuracy is telling you how well the model imitates the training target continuation under teacher forcing,
  • IFEval is telling you how well the model obeys free-generation output constraints,
  • your replay data is especially good at teaching those constraints.

So the anti-correlation is plausible and, in your setup, expected. The replay is likely improving constraint compliance, not necessarily reference imitation. (Hugging Face)

Observation 2: in-training and post-training IFEval rank runs differently

My interpretation is:

  • some of the difference is probably real, because replay changes the adaptation-vs-retention trajectory,
  • some of the difference may be an artifact, because trainer-internal generation-style evaluation is not equivalent to external benchmark evaluation.

So I would not trust “in-training IFEval” as a model-selection metric unless it is produced by the exact same external evaluator, with the exact same generation config, on the exact same saved checkpoints. (GitHub)

What the final numbers suggest

On the full 541-prompt IFEval benchmark, the gap between 0.675 and 0.660 is only about 8 prompts, while the gap between 0.675 and 0.610 is about 35 prompts. That means your conclusion that “no replay is clearly worse” looks much firmer than your conclusion about the exact winner among 5%, 18%, and 50%. The 50% vs 5% gap may still be real, but it is small enough that pipeline details and random variance matter more. (arXiv)

What I think is most likely true in your case

The most likely story is this:

  • 0% replay fits the new SFT data most aggressively and loses more instruction-following behavior, so it ends up worst on final IFEval. (Hugging Face)
  • 5% replay gives a small retention anchor, so it can look unusually strong during training while still not being the best final checkpoint. (Hugging Face)
  • 50% replay preserves the instruction-following behaviors that IFEval rewards more strongly, so it can finish best even if it is not the best under teacher-forced metrics. (Hugging Face)

That is why your results can look “reversed” without anything being broken.

What to run next

These are the checks most likely to resolve the puzzle cleanly.

1. Re-score every saved checkpoint with one external evaluator

For all replay ratios, evaluate the same saved checkpoints using one single external script with fixed:

  • tokenizer
  • chat template
  • decoding mode
  • max_new_tokens
  • EOS/stop handling
  • IFEval implementation

This removes the biggest source of pipeline mismatch. (GitHub)

2. Report strict and loose together

Because the IFEval paper explicitly created a loose criterion to handle harmless formatting artifacts, reporting only strict can hide what kind of difference you are seeing. If a replay ratio improves strict and loose together, that is much stronger evidence of real behavior change than strict alone. (arXiv)

3. Use teacher-forced loss, not token accuracy, as the SFT-side diagnostic

If token accuracy is saturating or close to saturating, it stops being useful. Teacher-forced loss usually still contains signal after token accuracy flattens out. TRL exposes both because they are not interchangeable. (Hugging Face)

4. Save prompt-level pass/fail vectors

For each run, store which prompts passed and failed. If replay is mainly helping on things like bullet counts, P.S. endings, exact format, or length limits, that will show up immediately. That kind of prompt-level analysis is far more informative than comparing one average number. The IFEval benchmark is built precisely around these verifiable instruction types. (arXiv)

5. Verify your load path

If your post-training evaluation merges LoRA weights or reloads them in a different way than your training loop uses them, test the same prompt set with:

  • the live training object,
  • the reloaded adapter,
  • the merged model.

If those outputs differ, some of your reversal is likely load-path drift rather than true model ranking. (GitHub)

Bottom line

The clean conclusion is:

There is no contradiction between your results.
They are consistent with a model that gets better at imitating the labeled completion under teacher forcing when replay is low, but gets better at following explicit output constraints when replay is higher. Since IFEval scores the second behavior, not the first, replay can improve IFEval while hurting or barely helping token-level metrics. And since trainer-internal and external generation evaluation are not guaranteed to be equivalent, the ranking can differ between in-training and post-training evaluation. (Hugging Face)

The most defensible operational rule for your project is:

Use teacher-forced loss as a training diagnostic, but use one fixed external IFEval pipeline for model selection. (Hugging Face)