That’s nonsensical behavior. If it reproduces every time, it’s highly likely to be a strange bug. (Since it’s verified on the GitHub version of the code, it’s not impossible that it’s a bug that’s already been fixed…)
Short answer
It is not intended that you must define compute_metrics for eval_loss to be logged to MLflow.
If you can really reproduce “eval_loss disappears from MLflow as soon as I remove compute_metrics”, that is most likely a bug (or at least an undocumented edge-case) and it is reasonable to open a GitHub issue – but only after a couple of sanity checks.
Below is the reasoning, grounded in how the Trainer and the MLflow integration are implemented.
1. What the Trainer is supposed to do
1.1 Evaluation does not depend on compute_metrics
In Trainer.evaluate / Trainer.evaluation_loop the flow is (simplified):
- Run evaluation loop, accumulate per-batch losses.
- Compute the mean loss
eval_loss. - Build a
metricsdict. - If
self.compute_metricsis notNone, call it and merge its outputs intometrics. - Unconditionally add
eval_losstometrics. - Call
self.log_metrics("eval", metrics)andself.logso callbacks see those metrics. (MLflow)
Key point: the line that inserts eval_loss into metrics is outside of the if self.compute_metrics is not None block. So eval_loss should be present whether or not you pass a compute_metrics function.
So, by design:
compute_metricsis only for additional metrics (accuracy, F1, etc.).eval_lossis always computed and always passed into the logging pipeline, independent ofcompute_metrics.
1.2 Where those metrics go
Trainer.log_metrics("eval", metrics) eventually calls Trainer.log(logs) with a dict that contains:
{
"eval_loss": <float>,
"eval_runtime": ...,
"eval_samples_per_second": ...,
"eval_steps_per_second": ...,
"epoch": ...
# plus anything returned by compute_metrics if present
}
That same dict is:
- Printed to the terminal.
- Appended to
trainer.state.log_history. - Broadcast to all callbacks via
callback_handler.on_log.
So if you see {'eval_loss': ...} in the terminal, the Trainer is producing and emitting eval_loss.
2. What the MLflow integration is supposed to do
The built-in MLflow integration lives in transformers.integrations.MLflowCallback. Its on_log method is very simple: (MLflow)
def on_log(self, args, state, control, logs=None, **kwargs):
if not self._initialized:
self.setup(args, state, model)
if state.is_world_process_zero:
import mlflow
# Roughly:
mlflow.log_metrics(logs, step=state.global_step)
Important properties:
- It does not look at
compute_metrics. - It does not special-case
eval_lossvsloss. - It just logs every numeric key in
logsas an MLflow metric at the currentglobal_step.
Therefore, if:
- the callback is attached, and
logscontainseval_loss,
then MLflow will get an eval_loss time series.
So there is no intentional “if compute_metrics is missing, don’t log eval_loss” logic anywhere in the Hugging Face → MLflow pipeline.
3. Why might you observe a dependency on compute_metrics?
Given the code paths, there are only a few realistic explanations for:
“With
compute_metricsdefined, I seeeval_lossin MLflow. When I comment it out, it vanishes.”
3.1 Misinterpretation in the MLflow UI
Common pattern:
- With
compute_metrics, you see botheval_lossand, say,eval_accuracyin the metric list and you click around. - Without
compute_metrics, the only eval metric iseval_loss. If you only look at the chart forloss(training) you may think val is “gone”, even thougheval_lossexists as a separate metric.
Sanity check:
# After training
print([m for m in client.get_metric_history(run_id, "eval_loss")])
or simply:
- Open the run in MLflow
- Go to Metrics
- Check whether
eval_lossis listed as a metric, even whencompute_metricsis commented.
If eval_loss is in that list, then the behavior is actually correct, just a UI perception issue.
3.2 Callback wiring differences between your two experiments
You mentioned that you:
- Ran the official MLflow tutorial notebook (which uses
Trainer,compute_metrics, andwith mlflow.start_run(): trainer.train()), and - Modified it by only commenting out
compute_metrics.
Subtle but important possibilities:
- In one version,
MLflowCallbackis attached (viareport_to="mlflow"or default behavior), and in the other it is not (e.g.report_tochanged, environment variableDISABLE_MLFLOW_INTEGRATION, or a customcallbackslist). - The MLflow tutorial itself relies on the Trainer’s callback, not on a special “Transformer autolog” API, so if the callback is missing you’ll still see terminal logs but no metrics in MLflow.
Checks:
print(trainer.callback_handler.callbacks)
You should see an instance of transformers.integrations.MLflowCallback in that list, in both the “with compute_metrics” and “without” setups.
If it disappears only when you comment compute_metrics, that would be a real bug.
3.3 Evaluation logs never reach Trainer.log in your modified setup
Less likely here (because you say the terminal still shows eval_loss), but in general:
- Some custom trainers or wrappers print evaluation results directly instead of going through
Trainer.log. - In that case, MLflow never sees those numbers.
The robust way to check is to inspect trainer.state.log_history:
for entry in trainer.state.log_history:
if "eval_loss" in entry:
print(entry)
break
If those entries exist but MLflow has no eval_loss metric, the bug is between on_log and MLflow.
If those entries don’t exist, the evaluation loss is never entering the logging pipeline.
4. Is this intended behavior?
Based on the library design and current code:
compute_metricsis optional.eval_lossshould be computed, logged byTrainer, and forwarded to MLflow regardless of whethercompute_metricsis provided.- The MLflow callback does not depend on
compute_metrics.
So: the behavior you describe (“eval loss only appears in MLflow if I define compute_metrics”) is not the intended semantics.
It’s either:
- A misinterpretation of the MLflow UI (looking only at
loss, not ateval_loss), or - A genuine bug/edge-case in the interaction between your versions of
transformers,mlflow, and how callbacks are wired.
5. Should you open a GitHub issue?
Yes, but only after a minimal set of checks so that the issue is actionable.
Before opening:
-
Confirm that in the “no
compute_metrics” run:trainer.state.log_historycontains at least one dict with"eval_loss"in it.trainer.callback_handler.callbacksincludesMLflowCallback.- MLflow’s Metrics tab truly has no
eval_lossmetric, not even as a separate entry.
-
Note your versions:
transformers.__version__mlflow.__version__- Any wrapper trainers (TRL
SFTTrainer, PEFT, etc.) if present.
-
Prepare a minimal script:
- Start from the official MLflow tutorial code.
- Show version A: with
compute_metrics→eval_lossappears in MLflow. - Show version B: same code but with
compute_metrics=None→eval_lossmissing, while terminal andlog_historystill include it.
If, after these checks, the difference is still 100% tied to compute_metrics, it is appropriate to open an issue on the Transformers repo (because the MLflow callback lives there). In the issue, explicitly state that:
trainer.state.log_historycontainseval_loss,- but
MLflowCallbackdoes not log it unlesscompute_metricsis defined.
That will give the maintainers enough detail to reproduce and decide whether it’s a bug in the callback, in the Trainer, or in the MLflow tutorial wiring.