MLFlow not Logging Validation Curve

I have the following training arguments:

    training_args = TrainingArguments(
        output_dir="../model/lora",
        per_device_train_batch_size=1,
        learning_rate=5e-4,
        logging_steps=100,
        save_steps=2000,
        per_device_eval_batch_size=1,
        eval_strategy="steps",
        eval_steps=500,
        gradient_accumulation_steps=8,
        num_train_epochs=1,
        weight_decay=0.1,
        warmup_ratio=0.05,
        lr_scheduler_type="cosine",
        fp16=False,
        report_to=["mlflow"],
        run_name=timestamp,
    )

What I expected was something like the curves in the course (https://huggingface.co/learn/llm-course/chapter3/5) where you see both train and validation loss. But what happens in reality is that I can only see the train loss in MLFlow, while the validation is printed in the terminal.

Am I missing something? What could be wrong?

1 Like

Several potential causes seem to exist.


From the screenshots you posted we know two important facts:

  • Evaluation really runs: the terminal shows a log dict with 'eval_loss': 2.22..., 'eval_runtime': ....
  • Training is logged to MLflow: you see a smooth loss curve in the MLflow UI.

So the model does compute validation loss, but MLflow only shows the training metric you’re looking at.

Below is a detailed breakdown of how this logging pipeline works, what typically goes wrong in situations exactly like this, and what to do in each case.


1. How Hugging Face → MLflow logging actually works

  1. You create TrainingArguments(..., report_to=["mlflow"]).

  2. Transformers attaches an MLflowCallback to the trainer. This callback’s on_log method receives every log dict that Trainer emits and calls mlflow.log_metrics on all numeric entries. (mlflow.org)

  3. During training you get log dicts like:

    {'loss': 2.18, 'grad_norm': 0.26, 'learning_rate': 8.9e-5, 'epoch': 0.74}
    
  4. During evaluation (triggered by eval_strategy="steps" + eval_steps=500) you get log dicts like:

    {'eval_loss': 2.22, 'eval_runtime': 2145.9,
     'eval_samples_per_second': 11.98, 'eval_steps_per_second': 11.98,
     'epoch': 0.72}
    

    Your terminal screenshot shows exactly such a dict, so step (4) is definitely happening.

  5. MLflow stores each metric name as its own time series keyed by step. So:

    • loss → training curve
    • eval_loss → validation curve

This is the same mechanism used in MLflow’s own Transformers fine-tuning tutorial; their logs also contain both loss and eval_loss and both appear as separate metrics in the MLflow UI. (mlflow.org)


2. Most common, very simple cause

2.1 You are only plotting loss, not eval_loss

In MLflow, each metric has its own chart:

  • The chart labeled loss shows only training loss.
  • The chart labeled eval_loss shows only evaluation loss.
  • MLflow does not automatically overlay eval_loss on the loss plot.

This is different from the course screenshots, which use Weights & Biases; W&B makes it easy to overlay multiple metrics on the same plot by default. MLflow’s default “single metric per chart” UI leads to exactly the confusion you’re describing.

What to check

  1. Open the run in MLflow.
  2. Go to the Metrics tab.
  3. In the metric list, look for a metric called eval_loss in addition to loss, grad_norm, learning_rate, etc.
  4. Click on eval_loss. That chart should show your sparse validation points (one every eval_steps).

If eval_loss is in that list, then nothing is broken:

  • You do have validation logging.
  • You just need to select the eval_loss metric (or compare metrics in a separate view) instead of only looking at the loss chart.

3. If eval_loss is truly missing as a metric

If you do not see eval_loss in MLflow’s metric list, but you do see it printed in the terminal, then:

  • Transformers is computing and logging evaluation internally.
  • Those eval logs are not being forwarded to MLflow as metrics.

There are a few realistic causes for that.

3.1 MLflow callback not active or partially disabled

HF logging to MLflow only works if MLflowCallback is attached. (mlflow.org)

This can fail if:

  • You override callbacks when constructing the trainer and forget to add MLflowCallback.
  • The environment variable DISABLE_MLFLOW_INTEGRATION is set, which disables the built-in callback.
  • You’re using a wrapper (SFTTrainer, a custom trainer) that replaces the callback handler.

How to confirm

After building the trainer:

print(trainer.callback_handler.callbacks)

You should see something like <transformers.integrations.MLflowCallback object at ...> in that list.

If you don’t:

from transformers.integrations import MLflowCallback

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_ds,
    eval_dataset=eval_ds,
    callbacks=[MLflowCallback()],  # plus any other callbacks you want
)

Now both training and eval logs should be forwarded to MLflow.

3.2 You are using TRL SFTTrainer + PEFT/LoRA and the trainer was mis-configured

For LoRA + TRL SFTTrainer there have been several issues where evaluation loss was not computed at all, due to trainer.can_return_loss being set to False when the model’s forward did not accept a return_loss argument. (GitHub)

In those cases:

  • No eval_loss appears in the evaluation logs (terminal or log_history).
  • MLflow cannot log what doesn’t exist.

Workaround from those issues:

trainer = SFTTrainer(...)

# After trainer is created, before trainer.train():
trainer.can_return_loss = True

Along with, for some LLMs, explicitly setting label_names in TrainingArguments (often to an empty list for causal LMs): (Hugging Face Forums)

training_args = TrainingArguments(
    ...,
    label_names=[],  # for causal LM style SFT
)

However, your terminal screenshot already shows 'eval_loss': 2.22..., so your trainer is computing eval loss. That means this particular “no eval_loss at all” bug is probably not what you’re seeing, but it’s a common gotcha in similar LoRA/SFT setups.

3.3 Evaluation metrics never reach the callback, but exist in log_history

All logs that the Trainer emits are stored in trainer.state.log_history. That list will contain dicts for both training and evaluation. (Hugging Face Forums)

After training:

for entry in trainer.state.log_history:
    if "eval_loss" in entry:
        print(entry)
        break

Cases:

  • If you see entries with eval_loss, then Trainer is logging evaluation correctly.

    • If MLflow still has no eval_loss metric, the MLflow callback is not firing for eval logs → fix callback configuration as in 3.1.
  • If you don’t see any eval_loss entries in log_history, then evaluation loss is never actually passed into Trainer.log; your terminal log might be coming from a custom callback or custom print. MLflow cannot see these, so you need to either:

    • Fix the trainer so it goes through Trainer.log, or
    • Log eval loss to MLflow manually (next section).

4. Robust: explicitly log eval loss from log_history

Even if you never figure out which callback is mis-configured, you can always recover both curves from trainer.state.log_history and push them into MLflow yourself. This pattern is also what many people use when working around logging bugs. (Hugging Face Forums)

Example:

import mlflow

# ... build trainer & train ...
trainer.train()

for i, entry in enumerate(trainer.state.log_history):
    step = entry.get("step", entry.get("global_step", i))

    if "loss" in entry:
        mlflow.log_metric("loss", entry["loss"], step=step)

    if "eval_loss" in entry:
        mlflow.log_metric("eval_loss", entry["eval_loss"], step=step)

Then in MLflow you will definitely have both loss and eval_loss metrics, regardless of what the built-in callback did.


5. Sanity checks specific to your arguments

Your TrainingArguments:

training_args = TrainingArguments(
    output_dir="../model/lora",
    per_device_train_batch_size=1,
    learning_rate=5e-4,
    logging_steps=100,
    save_steps=2000,
    per_device_eval_batch_size=1,
    eval_strategy="steps",        # correct name on recent transformers
    eval_steps=500,
    gradient_accumulation_steps=8,
    num_train_epochs=1,
    weight_decay=0.1,
    warmup_ratio=0.05,
    lr_scheduler_type="cosine",
    fp16=False,
    report_to=["mlflow"],
    run_name=timestamp,
)

Notes:

  • On new Transformers versions, evaluation_strategy was renamed to eval_strategy; your usage is consistent with the new API. (GitHub)
  • eval_strategy="steps" + eval_steps=500 will produce one eval log roughly every 500 optimization steps (i.e., after gradient accumulation).
  • logging_steps=100 controls how often training loss is logged. It doesn’t affect the frequency of eval logs.

Given that:

  1. You see {'eval_loss': ...} in the terminal → evaluation is running and producing a metric.
  2. You see a smooth loss curve in MLflow → Trainer logs are reaching MLflow.
  3. The remaining question is visibility (are you clicking eval_loss?) or callback wiring (is MLflowCallback actually attached?).

6. Summary of causes and fixes

Cause A – eval_loss is there but you are only plotting loss

Symptom: eval_loss appears in the MLflow metric list, but you only opened the loss chart.

Fix:
Open the Metrics tab → click the eval_loss metric → optionally compare metrics or download data and plot both curves together.


Cause B – MLflow callback not attached or disabled

Symptom: loss is logged in MLflow, but no eval_loss metric; trainer.state.log_history contains eval entries.

Fix:

from transformers.integrations import MLflowCallback

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_ds,
    eval_dataset=eval_ds,
    callbacks=[MLflowCallback()],
)

And ensure DISABLE_MLFLOW_INTEGRATION is not set in the environment.


Cause C – Trainer never passes eval loss into the logging pipeline

Symptom: Terminal prints something custom; trainer.state.log_history has no eval_loss entries.

Fix options:

  1. If you’re using TRL SFTTrainer + PEFT/LoRA and running into a known bug where eval loss is not computed, set:

    training_args = TrainingArguments(..., label_names=[])
    trainer.can_return_loss = True
    

    (GitHub)

  2. As a robust fallback, log from log_history manually with mlflow.log_metric("eval_loss", ...) as shown above.


If you walk through:

  1. “Is there an eval_loss metric in MLflow’s Metrics tab?”
  2. “Does trainer.callback_handler.callbacks include MLflowCallback?”
  3. “Does trainer.state.log_history actually contain eval_loss entries?”

you will be able to pinpoint which of these causes is affecting your run and apply the corresponding fix.

I went through @John6666’s reply which led me to the official MLFlow example.

I reran the example, and it did not matter whether I added report_to=["mlflow"] or the recommended with mlflow.start_run() as run:. The example did log both vlaidation and training loss.

So I changed one thing in the Trainer initialisation, and that is commenting compute_metrics:

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_tokenized,
    eval_dataset=test_tokenized,
#    compute_metrics=compute_metrics,
)

This stopped the evaluation loss being reported to MLFlow. This to me seems like a strange behaviour. Is that a bug or the intended use? Should I open a ticket on GitHub?

1 Like

That’s nonsensical behavior. If it reproduces every time, it’s highly likely to be a strange bug. (Since it’s verified on the GitHub version of the code, it’s not impossible that it’s a bug that’s already been fixed…)


Short answer
It is not intended that you must define compute_metrics for eval_loss to be logged to MLflow.
If you can really reproduce “eval_loss disappears from MLflow as soon as I remove compute_metrics”, that is most likely a bug (or at least an undocumented edge-case) and it is reasonable to open a GitHub issue – but only after a couple of sanity checks.

Below is the reasoning, grounded in how the Trainer and the MLflow integration are implemented.


1. What the Trainer is supposed to do

1.1 Evaluation does not depend on compute_metrics

In Trainer.evaluate / Trainer.evaluation_loop the flow is (simplified):

  1. Run evaluation loop, accumulate per-batch losses.
  2. Compute the mean loss eval_loss.
  3. Build a metrics dict.
  4. If self.compute_metrics is not None, call it and merge its outputs into metrics.
  5. Unconditionally add eval_loss to metrics.
  6. Call self.log_metrics("eval", metrics) and self.log so callbacks see those metrics. (MLflow)

Key point: the line that inserts eval_loss into metrics is outside of the if self.compute_metrics is not None block. So eval_loss should be present whether or not you pass a compute_metrics function.

So, by design:

  • compute_metrics is only for additional metrics (accuracy, F1, etc.).
  • eval_loss is always computed and always passed into the logging pipeline, independent of compute_metrics.

1.2 Where those metrics go

Trainer.log_metrics("eval", metrics) eventually calls Trainer.log(logs) with a dict that contains:

{
    "eval_loss": <float>,
    "eval_runtime": ...,
    "eval_samples_per_second": ...,
    "eval_steps_per_second": ...,
    "epoch": ...
    # plus anything returned by compute_metrics if present
}

That same dict is:

  • Printed to the terminal.
  • Appended to trainer.state.log_history.
  • Broadcast to all callbacks via callback_handler.on_log.

So if you see {'eval_loss': ...} in the terminal, the Trainer is producing and emitting eval_loss.


2. What the MLflow integration is supposed to do

The built-in MLflow integration lives in transformers.integrations.MLflowCallback. Its on_log method is very simple: (MLflow)

def on_log(self, args, state, control, logs=None, **kwargs):
    if not self._initialized:
        self.setup(args, state, model)

    if state.is_world_process_zero:
        import mlflow
        # Roughly:
        mlflow.log_metrics(logs, step=state.global_step)

Important properties:

  • It does not look at compute_metrics.
  • It does not special-case eval_loss vs loss.
  • It just logs every numeric key in logs as an MLflow metric at the current global_step.

Therefore, if:

  • the callback is attached, and
  • logs contains eval_loss,

then MLflow will get an eval_loss time series.

So there is no intentional “if compute_metrics is missing, don’t log eval_loss” logic anywhere in the Hugging Face → MLflow pipeline.


3. Why might you observe a dependency on compute_metrics?

Given the code paths, there are only a few realistic explanations for:

“With compute_metrics defined, I see eval_loss in MLflow. When I comment it out, it vanishes.”

3.1 Misinterpretation in the MLflow UI

Common pattern:

  • With compute_metrics, you see both eval_loss and, say, eval_accuracy in the metric list and you click around.
  • Without compute_metrics, the only eval metric is eval_loss. If you only look at the chart for loss (training) you may think val is “gone”, even though eval_loss exists as a separate metric.

Sanity check:

# After training
print([m for m in client.get_metric_history(run_id, "eval_loss")])

or simply:

  • Open the run in MLflow
  • Go to Metrics
  • Check whether eval_loss is listed as a metric, even when compute_metrics is commented.

If eval_loss is in that list, then the behavior is actually correct, just a UI perception issue.

3.2 Callback wiring differences between your two experiments

You mentioned that you:

  • Ran the official MLflow tutorial notebook (which uses Trainer, compute_metrics, and with mlflow.start_run(): trainer.train()), and
  • Modified it by only commenting out compute_metrics.

Subtle but important possibilities:

  • In one version, MLflowCallback is attached (via report_to="mlflow" or default behavior), and in the other it is not (e.g. report_to changed, environment variable DISABLE_MLFLOW_INTEGRATION, or a custom callbacks list).
  • The MLflow tutorial itself relies on the Trainer’s callback, not on a special “Transformer autolog” API, so if the callback is missing you’ll still see terminal logs but no metrics in MLflow.

Checks:

print(trainer.callback_handler.callbacks)

You should see an instance of transformers.integrations.MLflowCallback in that list, in both the “with compute_metrics” and “without” setups.

If it disappears only when you comment compute_metrics, that would be a real bug.

3.3 Evaluation logs never reach Trainer.log in your modified setup

Less likely here (because you say the terminal still shows eval_loss), but in general:

  • Some custom trainers or wrappers print evaluation results directly instead of going through Trainer.log.
  • In that case, MLflow never sees those numbers.

The robust way to check is to inspect trainer.state.log_history:

for entry in trainer.state.log_history:
    if "eval_loss" in entry:
        print(entry)
        break

If those entries exist but MLflow has no eval_loss metric, the bug is between on_log and MLflow.
If those entries don’t exist, the evaluation loss is never entering the logging pipeline.


4. Is this intended behavior?

Based on the library design and current code:

  • compute_metrics is optional.
  • eval_loss should be computed, logged by Trainer, and forwarded to MLflow regardless of whether compute_metrics is provided.
  • The MLflow callback does not depend on compute_metrics.

So: the behavior you describe (“eval loss only appears in MLflow if I define compute_metrics”) is not the intended semantics.

It’s either:

  • A misinterpretation of the MLflow UI (looking only at loss, not at eval_loss), or
  • A genuine bug/edge-case in the interaction between your versions of transformers, mlflow, and how callbacks are wired.

5. Should you open a GitHub issue?

Yes, but only after a minimal set of checks so that the issue is actionable.

Before opening:

  1. Confirm that in the “no compute_metrics” run:

    • trainer.state.log_history contains at least one dict with "eval_loss" in it.
    • trainer.callback_handler.callbacks includes MLflowCallback.
    • MLflow’s Metrics tab truly has no eval_loss metric, not even as a separate entry.
  2. Note your versions:

    • transformers.__version__
    • mlflow.__version__
    • Any wrapper trainers (TRL SFTTrainer, PEFT, etc.) if present.
  3. Prepare a minimal script:

    • Start from the official MLflow tutorial code.
    • Show version A: with compute_metrics → eval_loss appears in MLflow.
    • Show version B: same code but with compute_metrics=None → eval_loss missing, while terminal and log_history still include it.

If, after these checks, the difference is still 100% tied to compute_metrics, it is appropriate to open an issue on the Transformers repo (because the MLflow callback lives there). In the issue, explicitly state that:

  • trainer.state.log_history contains eval_loss,
  • but MLflowCallback does not log it unless compute_metrics is defined.

That will give the maintainers enough detail to reproduce and decide whether it’s a bug in the callback, in the Trainer, or in the MLflow tutorial wiring.

No idea what went wrong, decided at the end to delete my uv cache, venv, and lock file, and reinstall everything. and it worked afterwards. So no idea what was wrong there, but it works.

1 Like

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.