Transformers and Hyperparameter search using Optuna

Hello all,

I am currently trying to perform hyper-parameter search using trainer.hyperparameter_search(), using optuna as a back end. However, I have some questions regarding the several trials that are performed:

  1. Is there a way to retrieve the whole study object (with all trials and all metrics) and not only the best trial?

  2. Is the pruning of trials activated? I can see in the source code the relevant command (report of some intermediate metric and then the check trial.should_prune(): ), however I am not sure that is activated by the hyperparameter_search(). Ideally, I would like to have a report of the trials pruned and the ones that finished.

Thank you in advance,
Petrina

1 Like
  1. Not to my knowledge. But you can retrieve a log of all the Optuna trials (Hugging Face calls them “runs”), in two different ways.

First way
If you check the Trainer object after you have called its hyperparameter_search() method, you will find the log in Trainer.state.log_history, it’s a list.

Second way
Also, the logs are saved on disk as JSON files called trainer_state.json every time a checkpoint is saved during the optimization process. You can set the directory to store checkpoints with the output_dir parameter of TrainingArguments(); I recommend you also set its logging_strategy and save_strategy to 'epoch' such that you get one checkpoint, and therefore one log saved, at the end of every epoch.

BTW, if you find a way to fetch the whole Study object, please do let me know.

In addition to the above, you can get very detailed information about what Optuna is doing by enabling its persistent storage on a SQLite database. It is straightforward. To browse the database info I then use DBeaver CE. To enable the database storage, you can pass parameters storage and load_if_exists to Trainer.hyperparameter_search(), e.g.:

res = trainer.hyperparameter_search(hp_space=hp_space,
    n_trials=params.fine_tuning.n_trials,
    direction='maximize',
    compute_objective=compute_objective,
    sampler=optuna_sampler,
    study_name=study_name,
    storage='sqlite:///my_optuna_studies.db',
    load_if_exists=True,
    pruner=NopPruner())

Those parameters are then forwarded by the Trainer to optuna.create_study(), see the Optuna documentation to see their usage.

  1. Yes, by default, unless you disable it. If you enable persistence in the database with Optuna (as per point above) you can find in the database information about which trials have been pruned, in the trials table, state column.

To disable pruning, you can pass parameter pruner=NopPruner() to Trainer.hyperparameter_search(), like I did in the snippet of code at point 1. See Optuna documentation to chose the pruning strategy instead.

There are a few caveats I have stumbled upon, I am still investigating but they may interest you.
a) after hyperparameters search, Trainer.state.best_model_checkpoint doesn’t contain the path to the best checkpoint saved, but to the last checkpoint saved instead. If I am getting this right, it is a Hugging Face bug.
b) Optuna’s DB persistence should allow you to interrupt and then resume the hyperparameters search. While I have that feature work correctly when using just Optuna, in a toy example, so far I couldn’t make it work with Hugging Face: when I try to resume the hyperparameters search, it restarts from the beginning instead, ignoring trials from the previous experiment. Perhaps I am doing something wrong here.

1 Like

Thank you very much for your analytical answer!

For point 1, I have also noticed 2 things that are not straight-forward to me, but yet I haven’t put much effort into them. Just mentioning them in case you have seen something similar:

My code for hyper parameter search is the following:

  training_args = TrainingArguments(         
      output_dir=SAVE_DIR,      
      per_device_train_batch_size=PER_DEVICE_TRAIN_BATCH,         
      per_device_eval_batch_size=PER_DEVICE_EVAL_BATCH,
      evaluation_strategy = "epoch",
      logging_strategy="epoch")     
    
  trainer = Trainer(
    model=None,
    args=training_args,
    train_dataset=tokenized_dset['train'],
    eval_dataset=tokenized_dset['validation'],
    model_init=model_init
  )
    
  best_trial = trainer.hyperparameter_search(
  direction="minimize",
  backend="optuna",
  hp_space=optuna_hp_space,
  n_trials=NUM_TRIALS
  )
  1. Even though I have set the logging_strategy="epoch", I get logs every 500 steps, so in the directory I have checkpoint-500, checkpoint-1000 etc., which is confusing since every log contains several epochs.

  2. In the directory there is a json file that contains the whole history of epochs (It could be the same as trainer.state.log_history), however the metrics of the last epoch aren’t logged in this file, if I have noticed correctly.

For point 2, yes I run a study with 10 trials and I noticed that some where pruned, so I guess it is working. I will definitely consider the SQLite database since it seems very handy in keeping all the information about trials.

Thank you for the caveats too. I will keep them in mind and in case I find a way to have as an object the whole study, I will let you know!

Thanks again,
Petrina

Steps with transformers are not epochs. In my understanding, 1 step = 1 batch
So if you have a train set with, say, 1000 samples, and you train in batches of 64 samples, then it will take you 16 steps (16 batches) to push the whole train set through training once, i.e. to do 1 epoch. In this example, 1 epoch = 16 steps. If you do the math for your script, you should find that, in your case, 500 steps = 1 epoch.

Not sure I understand which .json file you are referring to. What is the complete path?
Log fiiles I have are organized like this

There is one directory for every run (what Optuna calls “trials”), and every run contains its checkpoints, in my case they are saved every 32 steps because here I have 1 epoch = 32 steps . In every checkpoint, trainer_state.json contains all the info for that specific run up to the given checkpoint included.

If you are looking for a single, overall log file with all the runs together, no I don’t have it. I get one log file trainer_state.json for every checkpoint of every run.

Ok yes, this clarifies the issue. The dataset I had would result in 1 epoch=500 steps, and because the default number of steps to log when logging_strategy="steps" was 500, I thought that this logs anyways every 500 steps. Thank you very much for this!

Yes, you understood correctly, thank you for the clarifications on this too.

Also, for the trainer.state.log_history, just to underline that it contains only the log of the last trial. So, as you stated, I understand there is not an obvious way to get the logs for the wholes study.

Thank you very much once again for the help!