Using hyperparameter-search in Trainer

Note that you can use pretty much anything in optuna and Ray Tune by just subclassing the Trainer and overriding the proper methods.

2 Likes

Iā€™m having some issues with this, under the optuna backend. Here is my hyperparameter space :

def hyperparameter_space(trial):

    return {
        "learning_rate": trial.suggest_float("learning_rate", 1e-6, 1e-4, log=True),
        "per_device_train_batch_size": trial.suggest_categorical("per_device_train_batch_size", [8, 16, 32]),
        "weight_decay": trial.suggest_float("weight_decay", 1e-12, 1e-1, log=True),
        "adam_epsilon": trial.suggest_float("adam_epsilon", 1e-10, 1e-6, log=True)
    }

When I call trainer.hyperparameter_search on this, I find that it varies the number of epochs, too, despite these being fixed in TrainingArguments to 5. The run thatā€™s going now has run 5-epoch trials a few times but now itā€™s running a 20-epoch trialā€¦ Has anyone observed anything like this ?

Thank you very much.

That may be linked to some bug I fixed a few weeks ago with the Trainer modifying its TrainingArguments: it used to change the value of max_steps which would then change the number of epochs for you since you are changing the batch size.

Can you check if you get this behavior on current master?

Hi @sgugger, in case youā€™re not aware of it, it seems the latest commit on master broke the Colab notebook you shared on Twitter

Trying to run that notebook, I hit the following error when trying to run

best_run = trainer.hyperparameter_search(n_trials=10, direction="maximize")

with the optuna backend.

Stack trace:

/usr/local/lib/python3.6/dist-packages/sklearn/metrics/_classification.py:900: RuntimeWarning:

invalid value encountered in double_scalars

[W 2020-10-22 14:58:41,815] Trial 0 failed because of the following error: RuntimeError("Metrics contains more entries than just 'eval_loss', 'epoch' and 'total_flos', please provide your own compute_objective function.",)
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/optuna/study.py", line 799, in _run_trial
    result = func(trial)
  File "/usr/local/lib/python3.6/dist-packages/transformers/integrations.py", line 114, in _objective
    trainer.train(model_path=model_path, trial=trial)
  File "/usr/local/lib/python3.6/dist-packages/transformers/trainer.py", line 803, in train
    self._maybe_log_save_evaluate(tr_loss, model, trial, epoch)
  File "/usr/local/lib/python3.6/dist-packages/transformers/trainer.py", line 855, in _maybe_log_save_evaluate
    self._report_to_hp_search(trial, epoch, metrics)
  File "/usr/local/lib/python3.6/dist-packages/transformers/trainer.py", line 537, in _report_to_hp_search
    self.objective = self.compute_objective(metrics.copy())
  File "/usr/local/lib/python3.6/dist-packages/transformers/trainer_utils.py", line 120, in default_compute_objective
    "Metrics contains more entries than just 'eval_loss', 'epoch' and 'total_flos', please provide your own compute_objective function."
RuntimeError: Metrics contains more entries than just 'eval_loss', 'epoch' and 'total_flos', please provide your own compute_objective function.
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-26-12c3f54763db> in <module>()
----> 1 best_run = trainer.hyperparameter_search(n_trials=10, direction="maximize")

10 frames
/usr/local/lib/python3.6/dist-packages/transformers/trainer_utils.py in default_compute_objective(metrics)
    118     if len(metrics) != 0:
    119         raise RuntimeError(
--> 120             "Metrics contains more entries than just 'eval_loss', 'epoch' and 'total_flos', please provide your own compute_objective function."
    121         )
    122     return loss

RuntimeError: Metrics contains more entries than just 'eval_loss', 'epoch' and 'total_flos', please provide your own compute_objective function.

I tried passing the dict produced by trainer.evaluate() in the compute_objective arg, but this complains that TypeError: 'dict' object is not callable.

Iā€™d be happy to fix the docs / code if you can give me some pointers on where to start!

I did realize and this has been fixed in this commit. Thanks for warning me :slight_smile:

2 Likes

@sgugger Early tests from the master branch seem to indicate the change in epochs is gone ( unless Iā€™m getting very unlucky with random numbers )ā€¦ Actually, Iā€™m no longer seeing this on the latest pip release either, I think. Thanks ! Doing great things.

Hi @sgugger! Do you have any suggestion about how I should be able to use the hyperparameter-search from Trainer with optuna as backend and integrate it with wandb?

When I try to do it it returns that I can only use only one wandb per model ):

It throws the following error:
You can only call wandb.watch once per model. Pass a new instance of the model if you need to call wandb.watch again in your code.

I have no idea of where the problem lies. Iā€™ll look at it when I have some time, but we usually let the maintainers of the third-party libraries like optuna and wandb fix the integrations themselves as they know their tools better :slight_smile:

2 Likes

Hi @sgugger! Iā€™m trying to train my model using PopulationBasedTraining from ray. This is how Iā€™m doing the search:

from ray.tune.schedulers import PopulationBasedTraining
from ray.tune import uniform
from random import randint

scheduler = PopulationBasedTraining(
    mode = "max",
    metric='mean_accuracy',
    perturbation_interval=2,
    hyperparam_mutations={
        "weight_decay": lambda: uniform(0.0, 0.3),
        "learning_rate": lambda: uniform(1e-5, 5e-5),
        "per_gpu_train_batch_size": [16, 32, 64],
        "num_train_epochs": [2,3,4],
        "warmup_steps":lambda: randint(0, 500)
    }
)

best_trial = trainer.hyperparameter_search(
    direction="maximize",
    backend="ray",
    n_trials=4,
    keep_checkpoints_num=1,
    scheduler=scheduler)

Iā€™m getting this error:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/ray/tune/trial_runner.py", line 726, in _process_trial
    result = self.trial_executor.fetch_result(trial)
  File "/usr/local/lib/python3.6/dist-packages/ray/tune/ray_trial_executor.py", line 489, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/usr/local/lib/python3.6/dist-packages/ray/worker.py", line 1452, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(TuneError): ray::ImplicitFunc.train() (pid=1971, ip=172.28.0.2)
  File "python/ray/_raylet.pyx", line 482, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 436, in ray._raylet.execute_task.function_executor
  File "/usr/local/lib/python3.6/dist-packages/ray/tune/trainable.py", line 336, in train
    result = self.step()
  File "/usr/local/lib/python3.6/dist-packages/ray/tune/function_runner.py", line 366, in step
    self._report_thread_runner_error(block=True)
  File "/usr/local/lib/python3.6/dist-packages/ray/tune/function_runner.py", line 513, in _report_thread_runner_error
    .format(err_tb_str)))
ray.tune.error.TuneError: Trial raised an exception. Traceback:
ray::ImplicitFunc.train() (pid=1971, ip=172.28.0.2)
  File "/usr/local/lib/python3.6/dist-packages/ray/tune/function_runner.py", line 248, in run
    self._entrypoint()
  File "/usr/local/lib/python3.6/dist-packages/ray/tune/function_runner.py", line 316, in entrypoint
    self._status_reporter.get_checkpoint())
  File "/usr/local/lib/python3.6/dist-packages/ray/tune/function_runner.py", line 575, in _trainable_func
    output = fn()
  File "/usr/local/lib/python3.6/dist-packages/transformers/integrations.py", line 180, in _objective
    trainer.train(model_path=model_path, trial=trial)
  File "/usr/local/lib/python3.6/dist-packages/transformers/trainer.py", line 577, in train
    self._hp_search_setup(trial)
  File "/usr/local/lib/python3.6/dist-packages/transformers/trainer.py", line 519, in _hp_search_setup
    value = type(old_attr)(value)
TypeError: float() argument must be a string or a number, not 'Float'

I have tried to use hp_space instead of defining the parameters inside of the scheduler, but the parameters donā€™t appear on the training and I get a similar error with Ƭnt instead of float.

2 Likes

I have no idea of why you get this type Float that is not castable to float. If this is the return type of ray.tune.uniform, I think you might have to add something to convert it to a regular Python float in your lambda functions.

I donā€™t know what really happened but I changed parameters to:

hyperparam_mutations={
        "weight_decay": tune.uniform(0.0, 0.3),
        "learning_rate": tune.uniform(1e-5, 5e-5),
        "per_device_train_batch_size": tune.choice([16, 32, 64]),
        "num_train_epochs": tune.choice([2,3,4]),
        "warmup_steps":tune.choice(range(0, 500))
    }

and it seems to work.

But now I have another problem. I created a custom function to return accuracy, which is passed to the trainer. I want to use that accuracy as metric in ray. I saw the example compute_objective function that you posted, but I donā€™t know what is metrics and how to use accuracy.

2 Likes

Hi @tr3cks

here metrics is dict which contains the metrics you defined, loss, accuracy etc.

So to use accuracy as metric/objective for hparam search, you should return accuracy value from the compute_objective function.

If your key is accuracy then you could return metrics["accuracy"] from the compute_objective function.

1 Like

Hi @sgugger, I am using raytune with huggingface for hyperparameter tunning, here is my code snippet:

from ray.tune.schedulers import PopulationBasedTraining
from ray.tune import uniform
from random import randint
scheduler = PopulationBasedTraining(
    mode = "max",
    metric='mean_accuracy',
    perturbation_interval=2,
    hyperparam_mutations={
        "weight_decay": lambda: uniform(0.0, 0.3),
        "learning_rate": lambda: uniform(1e-5, 5e-5),
        "per_gpu_train_batch_size": [16, 32, 64],
        "num_train_epochs": [2,3,4],
        "warmup_steps":lambda: randint(0, 500)
    }
)

best_trial = trainer.hyperparameter_search(
    direction="maximize",
    backend="ray",
    n_trials=4,
    keep_checkpoints_num=1,
    scheduler=scheduler)

However, this code results in the following error:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/usr/local/lib/python3.7/dist-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/usr/local/lib/python3.7/dist-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(TuneError): e[36mray::ImplicitFunc.train_buffered()e[39m (pid=800, ip=172.28.0.2)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/usr/local/lib/python3.7/dist-packages/ray/tune/trainable.py", line 167, in train_buffered
    result = self.train()
  File "/usr/local/lib/python3.7/dist-packages/ray/tune/trainable.py", line 226, in train
    result = self.step()
  File "/usr/local/lib/python3.7/dist-packages/ray/tune/function_runner.py", line 366, in step
    self._report_thread_runner_error(block=True)
  File "/usr/local/lib/python3.7/dist-packages/ray/tune/function_runner.py", line 513, in _report_thread_runner_error
    ("Trial raised an exception. Traceback:\n{}".format(err_tb_str)
ray.tune.error.TuneError: Trial raised an exception. Traceback:
e[36mray::ImplicitFunc.train_buffered()e[39m (pid=800, ip=172.28.0.2)
  File "/usr/local/lib/python3.7/dist-packages/ray/tune/function_runner.py", line 248, in run
    self._entrypoint()
  File "/usr/local/lib/python3.7/dist-packages/ray/tune/function_runner.py", line 316, in entrypoint
    self._status_reporter.get_checkpoint())
  File "/usr/local/lib/python3.7/dist-packages/ray/tune/function_runner.py", line 576, in _trainable_func
    output = fn()
  File "/usr/local/lib/python3.7/dist-packages/ray/tune/function_runner.py", line 651, in _inner
    inner(config, checkpoint_dir=None)
  File "/usr/local/lib/python3.7/dist-packages/ray/tune/function_runner.py", line 644, in inner
    fn_kwargs[k] = parameter_registry.get(prefix + k)
  File "/usr/local/lib/python3.7/dist-packages/ray/tune/registry.py", line 167, in get
    return ray.get(self.references[k])
  File "/usr/local/lib/python3.7/dist-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/ray/serialization.py", line 245, in deserialize_objects
    self._deserialize_object(data, metadata, object_ref))
  File "/usr/local/lib/python3.7/dist-packages/ray/serialization.py", line 192, in _deserialize_object
    return self._deserialize_msgpack_data(data, metadata_fields)
  File "/usr/local/lib/python3.7/dist-packages/ray/serialization.py", line 170, in _deserialize_msgpack_data
    python_objects = self._deserialize_pickle5_data(pickle5_data)
  File "/usr/local/lib/python3.7/dist-packages/ray/serialization.py", line 160, in _deserialize_pickle5_data
    obj = pickle.loads(in_band)
ModuleNotFoundError: No module named 'datasets_modules'

I would really appreciate if I could be help to identify the cause of this problem, thanks! :slight_smile:

Note: I have dataset correctly imported, everything works as expected except this snippet results in the error mentioned.

Good morning @richardliaw and @sgugger.

First of all, many thanks for your work on trainer.hyperparameter_search() that will help a lot of people!

However, I think that your notebooks ( Hyperparameter Search with Transformers and Ray Tune & text_classification.ipynb) need updates. Indeed, I tried to run them with Transformers + Ray [tune] but they failed with the same problem mentioned by @sonam ( ModuleNotFoundError: No module named 'datasets_modules').

To show the problem for example with the code from Hyperparameter Search with Transformers and Ray Tune, I published a test notebook in Colab.

Could you tell us how to make them run?
Many thanks in advance.

2 Likes

Hi! Thank you for the detailed explanation. One thing I am not sure is how to get my_objective to work? Like how would I define it if I want my objective to be f1 or maybe accuracy?
I have tried

def my_objective(metrics):
    return metrics['eval_macro_f1']

But it doesnā€™t work. What is metrics?

1 Like

hey @theudster, i suggest looking at the optuna docs to get an idea for how the objective functions are defined. hereā€™s a pytorch example that optimises for accuracy: optuna/pytorch_simple.py at master Ā· optuna/optuna Ā· GitHub

you should be able to adapt this to compute F1 score (or whatever metric you want to optimise for) :slight_smile:

4 Likes

Hello, I was not able to use optunaā€™s ā€œgc_after_trialā€ option in hyperparameter_search. Without it I always get CUDA out of memory. Is there a way?

I found out that when defining my metric, I need to add eval_ before it. So the metric I called ā€˜f1ā€™ had to be called ā€˜eval_f1ā€™

4 Likes

Hi @sonam, have you managed to get it to work? I keep getting the same error, so any help would be much appreciated.

Thanks!

How do you pass parameters to the model_init function? I have two parameters that need to be set in this function so I have defined it as so:

    def get_model(params):
        db_config = db_config_base
        db_config.update({'alpha': params['alpha_val'], 'dropout': params['dropout_val']})
        return DistilBERTForMultipleSequenceClassification.from_pretrained(db_config, num_labels1 = 2, num_labels2 = 8)

I thought maybe ray-tune would pass the parameters of the specific instance to this function but instead I get this error:

TypeError: ā€˜NoneTypeā€™ object is not subscriptable

AKA ā€˜paramsā€™ is None

1 Like