Using hyperparameter-search in Trainer

Note that you can use pretty much anything in optuna and Ray Tune by just subclassing the Trainer and overriding the proper methods.

1 Like

I’m having some issues with this, under the optuna backend. Here is my hyperparameter space :

def hyperparameter_space(trial):

    return {
        "learning_rate": trial.suggest_float("learning_rate", 1e-6, 1e-4, log=True),
        "per_device_train_batch_size": trial.suggest_categorical("per_device_train_batch_size", [8, 16, 32]),
        "weight_decay": trial.suggest_float("weight_decay", 1e-12, 1e-1, log=True),
        "adam_epsilon": trial.suggest_float("adam_epsilon", 1e-10, 1e-6, log=True)
    }

When I call trainer.hyperparameter_search on this, I find that it varies the number of epochs, too, despite these being fixed in TrainingArguments to 5. The run that’s going now has run 5-epoch trials a few times but now it’s running a 20-epoch trial… Has anyone observed anything like this ?

Thank you very much.

That may be linked to some bug I fixed a few weeks ago with the Trainer modifying its TrainingArguments: it used to change the value of max_steps which would then change the number of epochs for you since you are changing the batch size.

Can you check if you get this behavior on current master?

Hi @sgugger, in case you’re not aware of it, it seems the latest commit on master broke the Colab notebook you shared on Twitter

Trying to run that notebook, I hit the following error when trying to run

best_run = trainer.hyperparameter_search(n_trials=10, direction="maximize")

with the optuna backend.

Stack trace:

/usr/local/lib/python3.6/dist-packages/sklearn/metrics/_classification.py:900: RuntimeWarning:

invalid value encountered in double_scalars

[W 2020-10-22 14:58:41,815] Trial 0 failed because of the following error: RuntimeError("Metrics contains more entries than just 'eval_loss', 'epoch' and 'total_flos', please provide your own compute_objective function.",)
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/optuna/study.py", line 799, in _run_trial
    result = func(trial)
  File "/usr/local/lib/python3.6/dist-packages/transformers/integrations.py", line 114, in _objective
    trainer.train(model_path=model_path, trial=trial)
  File "/usr/local/lib/python3.6/dist-packages/transformers/trainer.py", line 803, in train
    self._maybe_log_save_evaluate(tr_loss, model, trial, epoch)
  File "/usr/local/lib/python3.6/dist-packages/transformers/trainer.py", line 855, in _maybe_log_save_evaluate
    self._report_to_hp_search(trial, epoch, metrics)
  File "/usr/local/lib/python3.6/dist-packages/transformers/trainer.py", line 537, in _report_to_hp_search
    self.objective = self.compute_objective(metrics.copy())
  File "/usr/local/lib/python3.6/dist-packages/transformers/trainer_utils.py", line 120, in default_compute_objective
    "Metrics contains more entries than just 'eval_loss', 'epoch' and 'total_flos', please provide your own compute_objective function."
RuntimeError: Metrics contains more entries than just 'eval_loss', 'epoch' and 'total_flos', please provide your own compute_objective function.
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-26-12c3f54763db> in <module>()
----> 1 best_run = trainer.hyperparameter_search(n_trials=10, direction="maximize")

10 frames
/usr/local/lib/python3.6/dist-packages/transformers/trainer_utils.py in default_compute_objective(metrics)
    118     if len(metrics) != 0:
    119         raise RuntimeError(
--> 120             "Metrics contains more entries than just 'eval_loss', 'epoch' and 'total_flos', please provide your own compute_objective function."
    121         )
    122     return loss

RuntimeError: Metrics contains more entries than just 'eval_loss', 'epoch' and 'total_flos', please provide your own compute_objective function.

I tried passing the dict produced by trainer.evaluate() in the compute_objective arg, but this complains that TypeError: 'dict' object is not callable.

I’d be happy to fix the docs / code if you can give me some pointers on where to start!

I did realize and this has been fixed in this commit. Thanks for warning me :slight_smile:

2 Likes

@sgugger Early tests from the master branch seem to indicate the change in epochs is gone ( unless I’m getting very unlucky with random numbers )… Actually, I’m no longer seeing this on the latest pip release either, I think. Thanks ! Doing great things.

Hi @sgugger! Do you have any suggestion about how I should be able to use the hyperparameter-search from Trainer with optuna as backend and integrate it with wandb?

When I try to do it it returns that I can only use only one wandb per model ):

It throws the following error:
You can only call wandb.watch once per model. Pass a new instance of the model if you need to call wandb.watch again in your code.

I have no idea of where the problem lies. I’ll look at it when I have some time, but we usually let the maintainers of the third-party libraries like optuna and wandb fix the integrations themselves as they know their tools better :slight_smile:

1 Like

Hi @sgugger! I’m trying to train my model using PopulationBasedTraining from ray. This is how I’m doing the search:

from ray.tune.schedulers import PopulationBasedTraining
from ray.tune import uniform
from random import randint

scheduler = PopulationBasedTraining(
    mode = "max",
    metric='mean_accuracy',
    perturbation_interval=2,
    hyperparam_mutations={
        "weight_decay": lambda: uniform(0.0, 0.3),
        "learning_rate": lambda: uniform(1e-5, 5e-5),
        "per_gpu_train_batch_size": [16, 32, 64],
        "num_train_epochs": [2,3,4],
        "warmup_steps":lambda: randint(0, 500)
    }
)

best_trial = trainer.hyperparameter_search(
    direction="maximize",
    backend="ray",
    n_trials=4,
    keep_checkpoints_num=1,
    scheduler=scheduler)

I’m getting this error:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/ray/tune/trial_runner.py", line 726, in _process_trial
    result = self.trial_executor.fetch_result(trial)
  File "/usr/local/lib/python3.6/dist-packages/ray/tune/ray_trial_executor.py", line 489, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/usr/local/lib/python3.6/dist-packages/ray/worker.py", line 1452, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(TuneError): ray::ImplicitFunc.train() (pid=1971, ip=172.28.0.2)
  File "python/ray/_raylet.pyx", line 482, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 436, in ray._raylet.execute_task.function_executor
  File "/usr/local/lib/python3.6/dist-packages/ray/tune/trainable.py", line 336, in train
    result = self.step()
  File "/usr/local/lib/python3.6/dist-packages/ray/tune/function_runner.py", line 366, in step
    self._report_thread_runner_error(block=True)
  File "/usr/local/lib/python3.6/dist-packages/ray/tune/function_runner.py", line 513, in _report_thread_runner_error
    .format(err_tb_str)))
ray.tune.error.TuneError: Trial raised an exception. Traceback:
ray::ImplicitFunc.train() (pid=1971, ip=172.28.0.2)
  File "/usr/local/lib/python3.6/dist-packages/ray/tune/function_runner.py", line 248, in run
    self._entrypoint()
  File "/usr/local/lib/python3.6/dist-packages/ray/tune/function_runner.py", line 316, in entrypoint
    self._status_reporter.get_checkpoint())
  File "/usr/local/lib/python3.6/dist-packages/ray/tune/function_runner.py", line 575, in _trainable_func
    output = fn()
  File "/usr/local/lib/python3.6/dist-packages/transformers/integrations.py", line 180, in _objective
    trainer.train(model_path=model_path, trial=trial)
  File "/usr/local/lib/python3.6/dist-packages/transformers/trainer.py", line 577, in train
    self._hp_search_setup(trial)
  File "/usr/local/lib/python3.6/dist-packages/transformers/trainer.py", line 519, in _hp_search_setup
    value = type(old_attr)(value)
TypeError: float() argument must be a string or a number, not 'Float'

I have tried to use hp_space instead of defining the parameters inside of the scheduler, but the parameters don’t appear on the training and I get a similar error with ìnt instead of float.

I have no idea of why you get this type Float that is not castable to float. If this is the return type of ray.tune.uniform, I think you might have to add something to convert it to a regular Python float in your lambda functions.

I don’t know what really happened but I changed parameters to:

hyperparam_mutations={
        "weight_decay": tune.uniform(0.0, 0.3),
        "learning_rate": tune.uniform(1e-5, 5e-5),
        "per_device_train_batch_size": tune.choice([16, 32, 64]),
        "num_train_epochs": tune.choice([2,3,4]),
        "warmup_steps":tune.choice(range(0, 500))
    }

and it seems to work.

But now I have another problem. I created a custom function to return accuracy, which is passed to the trainer. I want to use that accuracy as metric in ray. I saw the example compute_objective function that you posted, but I don’t know what is metrics and how to use accuracy.

Hi @tr3cks

here metrics is dict which contains the metrics you defined, loss, accuracy etc.

So to use accuracy as metric/objective for hparam search, you should return accuracy value from the compute_objective function.

If your key is accuracy then you could return metrics["accuracy"] from the compute_objective function.