Note that you can use pretty much anything in optuna
and Ray Tune
by just subclassing the Trainer
and overriding the proper methods.
Iām having some issues with this, under the optuna
backend. Here is my hyperparameter space :
def hyperparameter_space(trial):
return {
"learning_rate": trial.suggest_float("learning_rate", 1e-6, 1e-4, log=True),
"per_device_train_batch_size": trial.suggest_categorical("per_device_train_batch_size", [8, 16, 32]),
"weight_decay": trial.suggest_float("weight_decay", 1e-12, 1e-1, log=True),
"adam_epsilon": trial.suggest_float("adam_epsilon", 1e-10, 1e-6, log=True)
}
When I call trainer.hyperparameter_search
on this, I find that it varies the number of epochs, too, despite these being fixed in TrainingArguments
to 5. The run thatās going now has run 5-epoch trials a few times but now itās running a 20-epoch trial⦠Has anyone observed anything like this ?
Thank you very much.
That may be linked to some bug I fixed a few weeks ago with the Trainer
modifying its TrainingArguments
: it used to change the value of max_steps
which would then change the number of epochs for you since you are changing the batch size.
Can you check if you get this behavior on current master?
Hi @sgugger, in case youāre not aware of it, it seems the latest commit on master broke the Colab notebook you shared on Twitter
Trying to run that notebook, I hit the following error when trying to run
best_run = trainer.hyperparameter_search(n_trials=10, direction="maximize")
with the optuna backend.
Stack trace:
/usr/local/lib/python3.6/dist-packages/sklearn/metrics/_classification.py:900: RuntimeWarning:
invalid value encountered in double_scalars
[W 2020-10-22 14:58:41,815] Trial 0 failed because of the following error: RuntimeError("Metrics contains more entries than just 'eval_loss', 'epoch' and 'total_flos', please provide your own compute_objective function.",)
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/optuna/study.py", line 799, in _run_trial
result = func(trial)
File "/usr/local/lib/python3.6/dist-packages/transformers/integrations.py", line 114, in _objective
trainer.train(model_path=model_path, trial=trial)
File "/usr/local/lib/python3.6/dist-packages/transformers/trainer.py", line 803, in train
self._maybe_log_save_evaluate(tr_loss, model, trial, epoch)
File "/usr/local/lib/python3.6/dist-packages/transformers/trainer.py", line 855, in _maybe_log_save_evaluate
self._report_to_hp_search(trial, epoch, metrics)
File "/usr/local/lib/python3.6/dist-packages/transformers/trainer.py", line 537, in _report_to_hp_search
self.objective = self.compute_objective(metrics.copy())
File "/usr/local/lib/python3.6/dist-packages/transformers/trainer_utils.py", line 120, in default_compute_objective
"Metrics contains more entries than just 'eval_loss', 'epoch' and 'total_flos', please provide your own compute_objective function."
RuntimeError: Metrics contains more entries than just 'eval_loss', 'epoch' and 'total_flos', please provide your own compute_objective function.
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<ipython-input-26-12c3f54763db> in <module>()
----> 1 best_run = trainer.hyperparameter_search(n_trials=10, direction="maximize")
10 frames
/usr/local/lib/python3.6/dist-packages/transformers/trainer_utils.py in default_compute_objective(metrics)
118 if len(metrics) != 0:
119 raise RuntimeError(
--> 120 "Metrics contains more entries than just 'eval_loss', 'epoch' and 'total_flos', please provide your own compute_objective function."
121 )
122 return loss
RuntimeError: Metrics contains more entries than just 'eval_loss', 'epoch' and 'total_flos', please provide your own compute_objective function.
I tried passing the dict
produced by trainer.evaluate()
in the compute_objective
arg, but this complains that TypeError: 'dict' object is not callable
.
Iād be happy to fix the docs / code if you can give me some pointers on where to start!
@sgugger Early tests from the master branch seem to indicate the change in epochs is gone ( unless Iām getting very unlucky with random numbers )⦠Actually, Iām no longer seeing this on the latest pip release either, I think. Thanks ! Doing great things.
Hi @sgugger! Do you have any suggestion about how I should be able to use the hyperparameter-search from Trainer with optuna
as backend and integrate it with wandb
?
When I try to do it it returns that I can only use only one wandb
per model ):
It throws the following error:
You can only call wandb.watch once per model. Pass a new instance of the model if you need to call wandb.watch again in your code.
I have no idea of where the problem lies. Iāll look at it when I have some time, but we usually let the maintainers of the third-party libraries like optuna and wandb fix the integrations themselves as they know their tools better
Hi @sgugger! Iām trying to train my model using PopulationBasedTraining
from ray
. This is how Iām doing the search:
from ray.tune.schedulers import PopulationBasedTraining
from ray.tune import uniform
from random import randint
scheduler = PopulationBasedTraining(
mode = "max",
metric='mean_accuracy',
perturbation_interval=2,
hyperparam_mutations={
"weight_decay": lambda: uniform(0.0, 0.3),
"learning_rate": lambda: uniform(1e-5, 5e-5),
"per_gpu_train_batch_size": [16, 32, 64],
"num_train_epochs": [2,3,4],
"warmup_steps":lambda: randint(0, 500)
}
)
best_trial = trainer.hyperparameter_search(
direction="maximize",
backend="ray",
n_trials=4,
keep_checkpoints_num=1,
scheduler=scheduler)
Iām getting this error:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/ray/tune/trial_runner.py", line 726, in _process_trial
result = self.trial_executor.fetch_result(trial)
File "/usr/local/lib/python3.6/dist-packages/ray/tune/ray_trial_executor.py", line 489, in fetch_result
result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
File "/usr/local/lib/python3.6/dist-packages/ray/worker.py", line 1452, in get
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(TuneError): ray::ImplicitFunc.train() (pid=1971, ip=172.28.0.2)
File "python/ray/_raylet.pyx", line 482, in ray._raylet.execute_task
File "python/ray/_raylet.pyx", line 436, in ray._raylet.execute_task.function_executor
File "/usr/local/lib/python3.6/dist-packages/ray/tune/trainable.py", line 336, in train
result = self.step()
File "/usr/local/lib/python3.6/dist-packages/ray/tune/function_runner.py", line 366, in step
self._report_thread_runner_error(block=True)
File "/usr/local/lib/python3.6/dist-packages/ray/tune/function_runner.py", line 513, in _report_thread_runner_error
.format(err_tb_str)))
ray.tune.error.TuneError: Trial raised an exception. Traceback:
ray::ImplicitFunc.train() (pid=1971, ip=172.28.0.2)
File "/usr/local/lib/python3.6/dist-packages/ray/tune/function_runner.py", line 248, in run
self._entrypoint()
File "/usr/local/lib/python3.6/dist-packages/ray/tune/function_runner.py", line 316, in entrypoint
self._status_reporter.get_checkpoint())
File "/usr/local/lib/python3.6/dist-packages/ray/tune/function_runner.py", line 575, in _trainable_func
output = fn()
File "/usr/local/lib/python3.6/dist-packages/transformers/integrations.py", line 180, in _objective
trainer.train(model_path=model_path, trial=trial)
File "/usr/local/lib/python3.6/dist-packages/transformers/trainer.py", line 577, in train
self._hp_search_setup(trial)
File "/usr/local/lib/python3.6/dist-packages/transformers/trainer.py", line 519, in _hp_search_setup
value = type(old_attr)(value)
TypeError: float() argument must be a string or a number, not 'Float'
I have tried to use hp_space
instead of defining the parameters inside of the scheduler, but the parameters donāt appear on the training and I get a similar error with Ƭnt
instead of float
.
I have no idea of why you get this type Float
that is not castable to float
. If this is the return type of ray.tune.uniform
, I think you might have to add something to convert it to a regular Python float
in your lambda
functions.
I donāt know what really happened but I changed parameters to:
hyperparam_mutations={
"weight_decay": tune.uniform(0.0, 0.3),
"learning_rate": tune.uniform(1e-5, 5e-5),
"per_device_train_batch_size": tune.choice([16, 32, 64]),
"num_train_epochs": tune.choice([2,3,4]),
"warmup_steps":tune.choice(range(0, 500))
}
and it seems to work.
But now I have another problem. I created a custom function to return accuracy, which is passed to the trainer. I want to use that accuracy as metric in ray
. I saw the example compute_objective
function that you posted, but I donāt know what is metrics
and how to use accuracy.
Hi @tr3cks
here metrics
is dict which contains the metrics you defined, loss, accuracy etc.
So to use accuracy as metric/objective for hparam search, you should return accuracy value from the compute_objective
function.
If your key is accuracy
then you could return metrics["accuracy"]
from the compute_objective
function.