MLFlow not Logging Validation Curve

That’s nonsensical behavior. If it reproduces every time, it’s highly likely to be a strange bug. (Since it’s verified on the GitHub version of the code, it’s not impossible that it’s a bug that’s already been fixed…)


Short answer
It is not intended that you must define compute_metrics for eval_loss to be logged to MLflow.
If you can really reproduce “eval_loss disappears from MLflow as soon as I remove compute_metrics”, that is most likely a bug (or at least an undocumented edge-case) and it is reasonable to open a GitHub issue – but only after a couple of sanity checks.

Below is the reasoning, grounded in how the Trainer and the MLflow integration are implemented.


1. What the Trainer is supposed to do

1.1 Evaluation does not depend on compute_metrics

In Trainer.evaluate / Trainer.evaluation_loop the flow is (simplified):

  1. Run evaluation loop, accumulate per-batch losses.
  2. Compute the mean loss eval_loss.
  3. Build a metrics dict.
  4. If self.compute_metrics is not None, call it and merge its outputs into metrics.
  5. Unconditionally add eval_loss to metrics.
  6. Call self.log_metrics("eval", metrics) and self.log so callbacks see those metrics. (MLflow)

Key point: the line that inserts eval_loss into metrics is outside of the if self.compute_metrics is not None block. So eval_loss should be present whether or not you pass a compute_metrics function.

So, by design:

  • compute_metrics is only for additional metrics (accuracy, F1, etc.).
  • eval_loss is always computed and always passed into the logging pipeline, independent of compute_metrics.

1.2 Where those metrics go

Trainer.log_metrics("eval", metrics) eventually calls Trainer.log(logs) with a dict that contains:

{
    "eval_loss": <float>,
    "eval_runtime": ...,
    "eval_samples_per_second": ...,
    "eval_steps_per_second": ...,
    "epoch": ...
    # plus anything returned by compute_metrics if present
}

That same dict is:

  • Printed to the terminal.
  • Appended to trainer.state.log_history.
  • Broadcast to all callbacks via callback_handler.on_log.

So if you see {'eval_loss': ...} in the terminal, the Trainer is producing and emitting eval_loss.


2. What the MLflow integration is supposed to do

The built-in MLflow integration lives in transformers.integrations.MLflowCallback. Its on_log method is very simple: (MLflow)

def on_log(self, args, state, control, logs=None, **kwargs):
    if not self._initialized:
        self.setup(args, state, model)

    if state.is_world_process_zero:
        import mlflow
        # Roughly:
        mlflow.log_metrics(logs, step=state.global_step)

Important properties:

  • It does not look at compute_metrics.
  • It does not special-case eval_loss vs loss.
  • It just logs every numeric key in logs as an MLflow metric at the current global_step.

Therefore, if:

  • the callback is attached, and
  • logs contains eval_loss,

then MLflow will get an eval_loss time series.

So there is no intentional “if compute_metrics is missing, don’t log eval_loss” logic anywhere in the Hugging Face → MLflow pipeline.


3. Why might you observe a dependency on compute_metrics?

Given the code paths, there are only a few realistic explanations for:

“With compute_metrics defined, I see eval_loss in MLflow. When I comment it out, it vanishes.”

3.1 Misinterpretation in the MLflow UI

Common pattern:

  • With compute_metrics, you see both eval_loss and, say, eval_accuracy in the metric list and you click around.
  • Without compute_metrics, the only eval metric is eval_loss. If you only look at the chart for loss (training) you may think val is “gone”, even though eval_loss exists as a separate metric.

Sanity check:

# After training
print([m for m in client.get_metric_history(run_id, "eval_loss")])

or simply:

  • Open the run in MLflow
  • Go to Metrics
  • Check whether eval_loss is listed as a metric, even when compute_metrics is commented.

If eval_loss is in that list, then the behavior is actually correct, just a UI perception issue.

3.2 Callback wiring differences between your two experiments

You mentioned that you:

  • Ran the official MLflow tutorial notebook (which uses Trainer, compute_metrics, and with mlflow.start_run(): trainer.train()), and
  • Modified it by only commenting out compute_metrics.

Subtle but important possibilities:

  • In one version, MLflowCallback is attached (via report_to="mlflow" or default behavior), and in the other it is not (e.g. report_to changed, environment variable DISABLE_MLFLOW_INTEGRATION, or a custom callbacks list).
  • The MLflow tutorial itself relies on the Trainer’s callback, not on a special “Transformer autolog” API, so if the callback is missing you’ll still see terminal logs but no metrics in MLflow.

Checks:

print(trainer.callback_handler.callbacks)

You should see an instance of transformers.integrations.MLflowCallback in that list, in both the “with compute_metrics” and “without” setups.

If it disappears only when you comment compute_metrics, that would be a real bug.

3.3 Evaluation logs never reach Trainer.log in your modified setup

Less likely here (because you say the terminal still shows eval_loss), but in general:

  • Some custom trainers or wrappers print evaluation results directly instead of going through Trainer.log.
  • In that case, MLflow never sees those numbers.

The robust way to check is to inspect trainer.state.log_history:

for entry in trainer.state.log_history:
    if "eval_loss" in entry:
        print(entry)
        break

If those entries exist but MLflow has no eval_loss metric, the bug is between on_log and MLflow.
If those entries don’t exist, the evaluation loss is never entering the logging pipeline.


4. Is this intended behavior?

Based on the library design and current code:

  • compute_metrics is optional.
  • eval_loss should be computed, logged by Trainer, and forwarded to MLflow regardless of whether compute_metrics is provided.
  • The MLflow callback does not depend on compute_metrics.

So: the behavior you describe (“eval loss only appears in MLflow if I define compute_metrics”) is not the intended semantics.

It’s either:

  • A misinterpretation of the MLflow UI (looking only at loss, not at eval_loss), or
  • A genuine bug/edge-case in the interaction between your versions of transformers, mlflow, and how callbacks are wired.

5. Should you open a GitHub issue?

Yes, but only after a minimal set of checks so that the issue is actionable.

Before opening:

  1. Confirm that in the “no compute_metrics” run:

    • trainer.state.log_history contains at least one dict with "eval_loss" in it.
    • trainer.callback_handler.callbacks includes MLflowCallback.
    • MLflow’s Metrics tab truly has no eval_loss metric, not even as a separate entry.
  2. Note your versions:

    • transformers.__version__
    • mlflow.__version__
    • Any wrapper trainers (TRL SFTTrainer, PEFT, etc.) if present.
  3. Prepare a minimal script:

    • Start from the official MLflow tutorial code.
    • Show version A: with compute_metricseval_loss appears in MLflow.
    • Show version B: same code but with compute_metrics=Noneeval_loss missing, while terminal and log_history still include it.

If, after these checks, the difference is still 100% tied to compute_metrics, it is appropriate to open an issue on the Transformers repo (because the MLflow callback lives there). In the issue, explicitly state that:

  • trainer.state.log_history contains eval_loss,
  • but MLflowCallback does not log it unless compute_metrics is defined.

That will give the maintainers enough detail to reproduce and decide whether it’s a bug in the callback, in the Trainer, or in the MLflow tutorial wiring.