Note that you can use pretty much anything in optuna
and Ray Tune
by just subclassing the Trainer
and overriding the proper methods.
Iām having some issues with this, under the optuna
backend. Here is my hyperparameter space :
def hyperparameter_space(trial):
return {
"learning_rate": trial.suggest_float("learning_rate", 1e-6, 1e-4, log=True),
"per_device_train_batch_size": trial.suggest_categorical("per_device_train_batch_size", [8, 16, 32]),
"weight_decay": trial.suggest_float("weight_decay", 1e-12, 1e-1, log=True),
"adam_epsilon": trial.suggest_float("adam_epsilon", 1e-10, 1e-6, log=True)
}
When I call trainer.hyperparameter_search
on this, I find that it varies the number of epochs, too, despite these being fixed in TrainingArguments
to 5. The run thatās going now has run 5-epoch trials a few times but now itās running a 20-epoch trialā¦ Has anyone observed anything like this ?
Thank you very much.
That may be linked to some bug I fixed a few weeks ago with the Trainer
modifying its TrainingArguments
: it used to change the value of max_steps
which would then change the number of epochs for you since you are changing the batch size.
Can you check if you get this behavior on current master?
Hi @sgugger, in case youāre not aware of it, it seems the latest commit on master broke the Colab notebook you shared on Twitter
Trying to run that notebook, I hit the following error when trying to run
best_run = trainer.hyperparameter_search(n_trials=10, direction="maximize")
with the optuna backend.
Stack trace:
/usr/local/lib/python3.6/dist-packages/sklearn/metrics/_classification.py:900: RuntimeWarning:
invalid value encountered in double_scalars
[W 2020-10-22 14:58:41,815] Trial 0 failed because of the following error: RuntimeError("Metrics contains more entries than just 'eval_loss', 'epoch' and 'total_flos', please provide your own compute_objective function.",)
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/optuna/study.py", line 799, in _run_trial
result = func(trial)
File "/usr/local/lib/python3.6/dist-packages/transformers/integrations.py", line 114, in _objective
trainer.train(model_path=model_path, trial=trial)
File "/usr/local/lib/python3.6/dist-packages/transformers/trainer.py", line 803, in train
self._maybe_log_save_evaluate(tr_loss, model, trial, epoch)
File "/usr/local/lib/python3.6/dist-packages/transformers/trainer.py", line 855, in _maybe_log_save_evaluate
self._report_to_hp_search(trial, epoch, metrics)
File "/usr/local/lib/python3.6/dist-packages/transformers/trainer.py", line 537, in _report_to_hp_search
self.objective = self.compute_objective(metrics.copy())
File "/usr/local/lib/python3.6/dist-packages/transformers/trainer_utils.py", line 120, in default_compute_objective
"Metrics contains more entries than just 'eval_loss', 'epoch' and 'total_flos', please provide your own compute_objective function."
RuntimeError: Metrics contains more entries than just 'eval_loss', 'epoch' and 'total_flos', please provide your own compute_objective function.
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<ipython-input-26-12c3f54763db> in <module>()
----> 1 best_run = trainer.hyperparameter_search(n_trials=10, direction="maximize")
10 frames
/usr/local/lib/python3.6/dist-packages/transformers/trainer_utils.py in default_compute_objective(metrics)
118 if len(metrics) != 0:
119 raise RuntimeError(
--> 120 "Metrics contains more entries than just 'eval_loss', 'epoch' and 'total_flos', please provide your own compute_objective function."
121 )
122 return loss
RuntimeError: Metrics contains more entries than just 'eval_loss', 'epoch' and 'total_flos', please provide your own compute_objective function.
I tried passing the dict
produced by trainer.evaluate()
in the compute_objective
arg, but this complains that TypeError: 'dict' object is not callable
.
Iād be happy to fix the docs / code if you can give me some pointers on where to start!
I did realize and this has been fixed in this commit. Thanks for warning me
@sgugger Early tests from the master branch seem to indicate the change in epochs is gone ( unless Iām getting very unlucky with random numbers )ā¦ Actually, Iām no longer seeing this on the latest pip release either, I think. Thanks ! Doing great things.
Hi @sgugger! Do you have any suggestion about how I should be able to use the hyperparameter-search from Trainer with optuna
as backend and integrate it with wandb
?
When I try to do it it returns that I can only use only one wandb
per model ):
It throws the following error:
You can only call wandb.watch once per model. Pass a new instance of the model if you need to call wandb.watch again in your code.
I have no idea of where the problem lies. Iāll look at it when I have some time, but we usually let the maintainers of the third-party libraries like optuna and wandb fix the integrations themselves as they know their tools better
Hi @sgugger! Iām trying to train my model using PopulationBasedTraining
from ray
. This is how Iām doing the search:
from ray.tune.schedulers import PopulationBasedTraining
from ray.tune import uniform
from random import randint
scheduler = PopulationBasedTraining(
mode = "max",
metric='mean_accuracy',
perturbation_interval=2,
hyperparam_mutations={
"weight_decay": lambda: uniform(0.0, 0.3),
"learning_rate": lambda: uniform(1e-5, 5e-5),
"per_gpu_train_batch_size": [16, 32, 64],
"num_train_epochs": [2,3,4],
"warmup_steps":lambda: randint(0, 500)
}
)
best_trial = trainer.hyperparameter_search(
direction="maximize",
backend="ray",
n_trials=4,
keep_checkpoints_num=1,
scheduler=scheduler)
Iām getting this error:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/ray/tune/trial_runner.py", line 726, in _process_trial
result = self.trial_executor.fetch_result(trial)
File "/usr/local/lib/python3.6/dist-packages/ray/tune/ray_trial_executor.py", line 489, in fetch_result
result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
File "/usr/local/lib/python3.6/dist-packages/ray/worker.py", line 1452, in get
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(TuneError): ray::ImplicitFunc.train() (pid=1971, ip=172.28.0.2)
File "python/ray/_raylet.pyx", line 482, in ray._raylet.execute_task
File "python/ray/_raylet.pyx", line 436, in ray._raylet.execute_task.function_executor
File "/usr/local/lib/python3.6/dist-packages/ray/tune/trainable.py", line 336, in train
result = self.step()
File "/usr/local/lib/python3.6/dist-packages/ray/tune/function_runner.py", line 366, in step
self._report_thread_runner_error(block=True)
File "/usr/local/lib/python3.6/dist-packages/ray/tune/function_runner.py", line 513, in _report_thread_runner_error
.format(err_tb_str)))
ray.tune.error.TuneError: Trial raised an exception. Traceback:
ray::ImplicitFunc.train() (pid=1971, ip=172.28.0.2)
File "/usr/local/lib/python3.6/dist-packages/ray/tune/function_runner.py", line 248, in run
self._entrypoint()
File "/usr/local/lib/python3.6/dist-packages/ray/tune/function_runner.py", line 316, in entrypoint
self._status_reporter.get_checkpoint())
File "/usr/local/lib/python3.6/dist-packages/ray/tune/function_runner.py", line 575, in _trainable_func
output = fn()
File "/usr/local/lib/python3.6/dist-packages/transformers/integrations.py", line 180, in _objective
trainer.train(model_path=model_path, trial=trial)
File "/usr/local/lib/python3.6/dist-packages/transformers/trainer.py", line 577, in train
self._hp_search_setup(trial)
File "/usr/local/lib/python3.6/dist-packages/transformers/trainer.py", line 519, in _hp_search_setup
value = type(old_attr)(value)
TypeError: float() argument must be a string or a number, not 'Float'
I have tried to use hp_space
instead of defining the parameters inside of the scheduler, but the parameters donāt appear on the training and I get a similar error with Ƭnt
instead of float
.
I have no idea of why you get this type Float
that is not castable to float
. If this is the return type of ray.tune.uniform
, I think you might have to add something to convert it to a regular Python float
in your lambda
functions.
I donāt know what really happened but I changed parameters to:
hyperparam_mutations={
"weight_decay": tune.uniform(0.0, 0.3),
"learning_rate": tune.uniform(1e-5, 5e-5),
"per_device_train_batch_size": tune.choice([16, 32, 64]),
"num_train_epochs": tune.choice([2,3,4]),
"warmup_steps":tune.choice(range(0, 500))
}
and it seems to work.
But now I have another problem. I created a custom function to return accuracy, which is passed to the trainer. I want to use that accuracy as metric in ray
. I saw the example compute_objective
function that you posted, but I donāt know what is metrics
and how to use accuracy.
Hi @tr3cks
here metrics
is dict which contains the metrics you defined, loss, accuracy etc.
So to use accuracy as metric/objective for hparam search, you should return accuracy value from the compute_objective
function.
If your key is accuracy
then you could return metrics["accuracy"]
from the compute_objective
function.
Hi @sgugger, I am using raytune with huggingface for hyperparameter tunning, here is my code snippet:
from ray.tune.schedulers import PopulationBasedTraining
from ray.tune import uniform
from random import randint
scheduler = PopulationBasedTraining(
mode = "max",
metric='mean_accuracy',
perturbation_interval=2,
hyperparam_mutations={
"weight_decay": lambda: uniform(0.0, 0.3),
"learning_rate": lambda: uniform(1e-5, 5e-5),
"per_gpu_train_batch_size": [16, 32, 64],
"num_train_epochs": [2,3,4],
"warmup_steps":lambda: randint(0, 500)
}
)
best_trial = trainer.hyperparameter_search(
direction="maximize",
backend="ray",
n_trials=4,
keep_checkpoints_num=1,
scheduler=scheduler)
However, this code results in the following error:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/ray/tune/trial_runner.py", line 586, in _process_trial
results = self.trial_executor.fetch_result(trial)
File "/usr/local/lib/python3.7/dist-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
File "/usr/local/lib/python3.7/dist-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/ray/worker.py", line 1456, in get
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(TuneError): e[36mray::ImplicitFunc.train_buffered()e[39m (pid=800, ip=172.28.0.2)
File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
File "/usr/local/lib/python3.7/dist-packages/ray/tune/trainable.py", line 167, in train_buffered
result = self.train()
File "/usr/local/lib/python3.7/dist-packages/ray/tune/trainable.py", line 226, in train
result = self.step()
File "/usr/local/lib/python3.7/dist-packages/ray/tune/function_runner.py", line 366, in step
self._report_thread_runner_error(block=True)
File "/usr/local/lib/python3.7/dist-packages/ray/tune/function_runner.py", line 513, in _report_thread_runner_error
("Trial raised an exception. Traceback:\n{}".format(err_tb_str)
ray.tune.error.TuneError: Trial raised an exception. Traceback:
e[36mray::ImplicitFunc.train_buffered()e[39m (pid=800, ip=172.28.0.2)
File "/usr/local/lib/python3.7/dist-packages/ray/tune/function_runner.py", line 248, in run
self._entrypoint()
File "/usr/local/lib/python3.7/dist-packages/ray/tune/function_runner.py", line 316, in entrypoint
self._status_reporter.get_checkpoint())
File "/usr/local/lib/python3.7/dist-packages/ray/tune/function_runner.py", line 576, in _trainable_func
output = fn()
File "/usr/local/lib/python3.7/dist-packages/ray/tune/function_runner.py", line 651, in _inner
inner(config, checkpoint_dir=None)
File "/usr/local/lib/python3.7/dist-packages/ray/tune/function_runner.py", line 644, in inner
fn_kwargs[k] = parameter_registry.get(prefix + k)
File "/usr/local/lib/python3.7/dist-packages/ray/tune/registry.py", line 167, in get
return ray.get(self.references[k])
File "/usr/local/lib/python3.7/dist-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/ray/serialization.py", line 245, in deserialize_objects
self._deserialize_object(data, metadata, object_ref))
File "/usr/local/lib/python3.7/dist-packages/ray/serialization.py", line 192, in _deserialize_object
return self._deserialize_msgpack_data(data, metadata_fields)
File "/usr/local/lib/python3.7/dist-packages/ray/serialization.py", line 170, in _deserialize_msgpack_data
python_objects = self._deserialize_pickle5_data(pickle5_data)
File "/usr/local/lib/python3.7/dist-packages/ray/serialization.py", line 160, in _deserialize_pickle5_data
obj = pickle.loads(in_band)
ModuleNotFoundError: No module named 'datasets_modules'
I would really appreciate if I could be help to identify the cause of this problem, thanks!
Note: I have dataset correctly imported, everything works as expected except this snippet results in the error mentioned.
Good morning @richardliaw and @sgugger.
First of all, many thanks for your work on trainer.hyperparameter_search()
that will help a lot of people!
However, I think that your notebooks ( Hyperparameter Search with Transformers and Ray Tune & text_classification.ipynb) need updates. Indeed, I tried to run them with Transformers + Ray [tune] but they failed with the same problem mentioned by @sonam ( ModuleNotFoundError: No module named 'datasets_modules'
).
To show the problem for example with the code from Hyperparameter Search with Transformers and Ray Tune, I published a test notebook in Colab.
Could you tell us how to make them run?
Many thanks in advance.
Hi! Thank you for the detailed explanation. One thing I am not sure is how to get my_objective to work? Like how would I define it if I want my objective to be f1 or maybe accuracy?
I have tried
def my_objective(metrics):
return metrics['eval_macro_f1']
But it doesnāt work. What is metrics?
hey @theudster, i suggest looking at the optuna docs to get an idea for how the objective functions are defined. hereās a pytorch example that optimises for accuracy: optuna/pytorch_simple.py at master Ā· optuna/optuna Ā· GitHub
you should be able to adapt this to compute F1 score (or whatever metric you want to optimise for)
Hello, I was not able to use optunaās āgc_after_trialā option in hyperparameter_search. Without it I always get CUDA out of memory. Is there a way?
I found out that when defining my metric, I need to add eval_ before it. So the metric I called āf1ā had to be called āeval_f1ā
Hi @sonam, have you managed to get it to work? I keep getting the same error, so any help would be much appreciated.
Thanks!
How do you pass parameters to the model_init function? I have two parameters that need to be set in this function so I have defined it as so:
def get_model(params):
db_config = db_config_base
db_config.update({'alpha': params['alpha_val'], 'dropout': params['dropout_val']})
return DistilBERTForMultipleSequenceClassification.from_pretrained(db_config, num_labels1 = 2, num_labels2 = 8)
I thought maybe ray-tune would pass the parameters of the specific instance to this function but instead I get this error:
TypeError: āNoneTypeā object is not subscriptable
AKA āparamsā is None