MLflowCallback TypeError: can only concatenate list (not "type") to list

I’m trying to capture autolog parameters with MLFlow. I’m using the MLflowCallback class written in the transformers.integrations.MLflowCallback and interfacing with transformers.TrainerCallback class.

Here’s the relevant code that tells MLFlow to create an experiment, send it to a host tracking server, and tells transformers what type of logging to do (as defined by the MLflowCallback class).

import torch
from transformers import RobertaTokenizer, RobertaForSequenceClassification, RobertaConfig, Trainer, TrainingArguments
import mlflow
from transformers import TrainerCallback
from transformers.integrations import MLflowCallback

remote_server_uri = [PRIVATE SERVER URL]
mlflow.set_tracking_uri(remote_server_uri)

# After loading and tokenizing data, here we run the training experiment
experiment_name = "ht_vp_roberta_randomSearch"
mlflow.set_experiment(experiment_name)             # server creates experiment folder at this point
with mlflow.start_run():
    training_args = TrainingArguments(
        output_dir=experiment_name,
        evaluation_strategy='epoch',
        eval_steps=500,
        gradient_accumulation_steps=1000,
        eval_accumulation_steps=1,
    )
    model = RobertaForSequenceClassification.from_pretrained("roberta-base")
    trainer = Trainer(
        args=training_args,
        tokenizer=tokenizer,
        train_dataset=train_dataset,
        eval_dataset=eval_dataset,
        model=model,
        compute_metrics=hf.compute_metrics,
        callbacks=MLflowCallback,                       # This triggers error
    )
    trainer.train()
    trainer.evaluate()

I’m getting the below error:

Traceback (most recent call last):
  File "mlflow_test_simple.py", line 80, in <module>
    trainer = Trainer(
  File "/home/jovyan/conda/dsEnv/lib/python3.8/site-packages/transformers/trainer.py", line 385, in __init__
    callbacks = default_callbacks if callbacks is None else default_callbacks + callbacks
TypeError: can only concatenate list (not "type") to list

To answer my initial question, the answer was that I needed to put the MLFlowCallback class inside a list when specifying the callbacks parameter:

trainer = Trainer(
    ...
    callbacks=[MLflowCallback],
)   

I also noticed that the MLflowCallback class calls mlflow.start_run() itself, so no need to call it myself in my script. So new code looks like this:

import torch
from transformers import RobertaTokenizer, RobertaForSequenceClassification, RobertaConfig, Trainer, TrainingArguments
import mlflow
from transformers import TrainerCallback
from transformers.integrations import MLflowCallback

remote_server_uri = [PRIVATE SERVER URL]
mlflow.set_tracking_uri(remote_server_uri)

# After loading and tokenizing data, here we run the training experiment
experiment_name = "ht_vp_roberta_randomSearch"
mlflow.set_experiment(experiment_name)             
training_args = TrainingArguments(
    output_dir=experiment_name,
    evaluation_strategy='epoch',
    eval_steps=500,
    gradient_accumulation_steps=1000,
    eval_accumulation_steps=1,
)
model = RobertaForSequenceClassification.from_pretrained("roberta-base")
trainer = Trainer(
    args=training_args,
    tokenizer=tokenizer,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    model=model,
    compute_metrics=compute_metrics,
    callbacks=[MLflowCallback],
)
trainer.train()
trainer.evaluate()

However, now I’m getting a new error, about nested processes, and mlflow wanting me to set nested=True in the start_run call that is embedded in the MLflowCallback class

Traceback (most recent call last):
  File "hypertune_mlflow_test_simple.py", line 89, in <module>
    trainer.train()
  File "/home/jovyan/conda/dsEnv/lib/python3.8/site-packages/transformers/trainer.py", line 1224, in train
    self.control = self.callback_handler.on_train_begin(args, self.state, self.control)
  File "/home/jovyan/conda/dsEnv/lib/python3.8/site-packages/transformers/trainer_callback.py", line 340, in on_train_begin
    return self.call_event("on_train_begin", args, state, control)
  File "/home/jovyan/conda/dsEnv/lib/python3.8/site-packages/transformers/trainer_callback.py", line 378, in call_event
    result = getattr(callback, event)(
  File "/home/jovyan/conda/dsEnv/lib/python3.8/site-packages/transformers/integrations.py", line 665, in on_train_begin
    self.setup(args, state, model)
  File "/home/jovyan/conda/dsEnv/lib/python3.8/site-packages/transformers/integrations.py", line 641, in setup
    self._ml_flow.start_run()
  File "/home/jovyan/conda/dsEnv/lib/python3.8/site-packages/mlflow/tracking/fluent.py", line 109, in start_run
    raise Exception(("Run with UUID {} is already active. To start a nested " +
Exception: Run with UUID 8a5c8b90de7c412fb6c857b54416f346 is already active. To start a nested run, call start_run with nested=True
1 Like

I just ran into the same problem.
I have just encountered the same problem

However, now I’m getting a new error, about nested processes, and mlflow wanting me to set nested=True in the start_run call that is embedded in the MLflowCallback class

And found out the cause of this error.
The cause of this problem was that Trainer’s __init__ automatically added a callback based on the installed package.
As a result, the callback given as an argument to Trainer was duplicated and mlflow.start_run() was executed twice.

The solution is to not specify the mlflow callback as an argument.

trainer = Trainer(
    args=training_args,
    tokenizer=tokenizer,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    model=model,
    compute_metrics=compute_metrics,
)

As a side note, I found the cause in lines 391-395 of trainer.py.
get_reporting_integration_callbacks(self.args.report_to) returns all integration callbacks installed and added it into callback handler.

default_callbacks = DEFAULT_CALLBACKS + get_reporting_integration_callbacks(self.args.report_to)
callbacks = default_callbacks if callbacks is None else default_callbacks + callbacks
self.callback_handler = CallbackHandler(
    callbacks, self.model, self.tokenizer, self.optimizer, self.lr_scheduler
)

If you’ve already solved it, I’ll leave the solution for those who are facing this problem anew.

1 Like

Some excellent sleuthing here! This confirms some behavior I’ve seen even when I don’t import mlflow but mlflow is installed in the same environment. Interesting!!!

1 Like