Trainer.Hyperparameter_search() Trials did not complete. How to optimize parameters with ray tune?

Hello! :grinning: :raising_hand_woman:
I used a deberta model (microsoft-deberta-v3-base) finetuning it and saved a checkpoint from it for inference.

I would like to do hyperparameter optimization with Ray Tune. My strategy was to load the model checkpoint and try the hyperparameter_search but when I try to use, it brings this error.

TuneError                                 Traceback (most recent call last)
<ipython-input-35-8e89fe27cbea> in <module>
     23   )
     24 
---> 25 best_run = trainer.hyperparameter_search(
     26   direction="maximize",
     27   n_trials=2

2 frames
/usr/local/lib/python3.8/dist-packages/transformers/trainer.py in hyperparameter_search(self, hp_space, compute_objective, n_trials, direction, backend, hp_name, **kwargs)
   2414             HPSearchBackend.WANDB: run_hp_search_wandb,
   2415         }
-> 2416         best_run = backend_dict[backend](self, n_trials, direction, **kwargs)
   2417 
   2418         self.hp_search_backend = None

/usr/local/lib/python3.8/dist-packages/transformers/integrations.py in run_hp_search_ray(trainer, n_trials, direction, **kwargs)
    336         dynamic_modules_import_trainable.__mixins__ = trainable.__mixins__
    337 
--> 338     analysis = ray.tune.run(
    339         dynamic_modules_import_trainable,
    340         config=trainer.hp_space(None),

/usr/local/lib/python3.8/dist-packages/ray/tune/tune.py in run(run_or_experiment, name, metric, mode, stop, time_budget_s, config, resources_per_trial, num_samples, local_dir, search_alg, scheduler, keep_checkpoints_num, checkpoint_score_attr, checkpoint_freq, checkpoint_at_end, verbose, progress_reporter, log_to_file, trial_name_creator, trial_dirname_creator, chdir_to_trial_dir, sync_config, export_formats, max_failures, fail_fast, restore, server_port, resume, reuse_actors, trial_executor, raise_on_failed_trial, callbacks, max_concurrent_trials, _experiment_checkpoint_dir, _remote, _remote_string_queue)
    754     if incomplete_trials:
    755         if raise_on_failed_trial and not state["signal"]:
--> 756             raise TuneError("Trials did not complete", incomplete_trials)
    757         else:
    758             logger.error("Trials did not complete: %s", incomplete_trials)

TuneError: ('Trials did not complete', [_objective_84707_00000, _objective_84707_00001])

I tried to build based on these two sources here (references) to use ray as a hyperparameter optimizer but I don’t know how to proceed and I’m having trouble. To use the ray optimizer do I need the config function?
I used the first example in hugging face doc as a base and it worked fine with the dataset glue but then i tried to replicate with the model i used and in another dataset I get this same error.

This is my code:

model_checkpoint = '/content/microsoft-deberta-v3-base_dataset_size-200_epochs-2_batch_size-32'

def model_init():
    return AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=3)

args = TrainingArguments(
      output_dir= "/content/ZeroTraining/",
      num_train_epochs= config["num_epochs"],
      per_device_train_batch_size= per_device_train_batch_size,
      seed=42,
      evaluation_strategy="steps",
      eval_steps=100,
      disable_tqdm=True

  )

trainer = Trainer(
  args= args,
  tokenizer= tokenizer,
  train_dataset= train_dataset,
  eval_dataset= val_dataset,
  model_init= model_init,
  compute_metrics= compute_metrics,
  )

best_run = trainer.hyperparameter_search(
  direction="maximize",
  n_trials=2 

)

can anybody help me?

References

During the attempts, some doubts arose about the implementation, among them - Can I optimize the hyperparameters along with the training? Can I save the best parameters together with a checkpoint and load the best model? How to acces the model for loading with an id?

1 Like