Using hyperparameter-search in Trainer

kaankork · August 11, 2021, 10:39am

Hi @pierreguillou, were you able to solve the error ModuleNotFoundError: No module named 'datasets_modules'? I’m getting the same error followed by another TuneError("Trials did not complete", incomplete_trials)

Devanshi · August 26, 2021, 11:16am

Hi @sonam / @brgsk
Were you able to resolve this issue?
I am getting the same error. Kindly help.

Thanks!

Devanshi · August 26, 2021, 11:18am

Hi @pierreguillou & @kaankork

Were you able to resolve this issue? I am getting the same error.
Kindly help.

Thanks!

brgsk · August 26, 2021, 11:40am

Hi @Devanshi, are you on the latest Transformers version? I think this PR fixed this issue.

Devanshi · August 26, 2021, 1:43pm

Hi @brgsk - thanks for that!
I have upgraded to the latest version and I am not getting that error anymore.

Thanks for your help

laveena · September 8, 2021, 4:48am

I am getting very stranger behavior on running trainer.hyperparameter_search with Pytorch Distributed data parallel with parameters batch_size. I am using two GPU nodes.
It sends different trials on the both GPUs with on one batch_size =4 and on the other batch_size=8.

It suppose to run one trial in distributed fashion in both the nodes. Please help.

training_args = TrainingArguments(
            output_dir=self.train_out_dir,          # output directory
            logging_dir=self.train_log_dir,            # directory for storing logs
            num_train_epochs=self.train_param_nb_epochs,              # total # of training epochs
            per_device_train_batch_size=self.train_param_per_device_train_batch_size,  # batch size per device during training
            per_device_eval_batch_size=self.train_param_per_device_eval_batch_size,   # batch size for evaluation
            warmup_steps=self.train_param_warmup_steps,                # number of warmup steps for learning rate scheduler
            weight_decay=self.train_param_weight_decay,               # strength of weight decay
            learning_rate = self.train_param_learning_rate, # args.learning_rate - default is 5e-5, our notebook had 2e-5
            adam_epsilon = self.train_param_adam_epsilon
        )
        
        def model_init():
            print("-----------------")
            print(model_full_path)
            print(self.nb_labels)
            print("-----------------")

            return AutoModelForSequenceClassification.from_pretrained(model_full_path,
                num_labels = self.nb_labels, # The number of output labels--2 for binary classification.
                output_attentions = False, # Whether the model returns attentions weights.
                output_hidden_states = False, # Whether the model returns all hidden-states.
            )

        from ray.tune.examples.pbt_transformers import utils

        trainer = Trainer(
            model_init=model_init,                 # the instantiated ðŸ¤— Transformers model to be trained
            args=training_args,          # training arguments, defined above
            train_dataset=train_dataset, # training dataset
            eval_dataset=val_dataset,     # evaluation dataset
            compute_metrics=utils.build_compute_metrics_fn('rte')
        )
  


        def hp_space(self, trial):
                return {
                    "learning_rate": trial.suggest_float("learning_rate", self.tune_param_learning_rate[0], self.tune_param_learning_rate[1]),
                  "per_device_train_batch_size": trial.suggest_int("per_device_train_batch_size", 4, 8, step=4),
            }
        
        best_run = trainer.hyperparameter_search(
            direction="maximize",
            hp_space = self.hp_space,
            backend=self.backend,
            n_trials = self.tune_param_n_trials,
        )

sgugger · September 8, 2021, 12:18pm

I’m not sure you can combine HP search with distributed training.

BramVanroy · September 8, 2021, 1:33pm

Haven’t dug deep, but maybe this helps? It does require you to launch ray on the cluster first, though.

soggles · September 8, 2021, 3:27pm

Hi there,

Would anyone be able to help me understand why, after updating my training args following a hyperparameter_search, num_epochs = 1 in the trainer.train() stage despite trainer.args.num_train_epochs = 5

I currently run hyperparameter_search- my_model is a class that holds all the class variables for now:

import ray
from ray import tune
from ray.tune import CLIReporter
from ray.tune.schedulers import PopulationBasedTraining
from transformers import Trainer, TrainingArguments
training_args = TrainingArguments(
            output_dir=f'{my_model.save_directory}/results',                     # output directory
            num_train_epochs=my_model.number_of_epochs,     # total number of training epochs
            per_device_train_batch_size=my_model.batch_size,  # batch size per device during training
            per_device_eval_batch_size=my_model.batch_size,   # batch size for evaluation
            warmup_steps=my_model.warmup_steps,             # number of warmup steps for learning rate scheduler
            weight_decay=my_model.weight_decay,             # strength of weight decay
            logging_dir=f'{my_model.save_directory}/logs', # directory for storing logs
            load_best_model_at_end=True,     # load the best model when finished training (default metric is loss)
            # but you can specify `metric_for_best_model` argument to change to accuracy or other metric
            logging_steps=my_model.logging_steps,           # log & save weights each logging_steps
            evaluation_strategy="steps",     # evaluate each `logging_steps`
        )

trainer = Trainer(
            model_init=my_model.get_model,                    # the instantiated Transformers model to be trained
            model=my_model.model,
            args=training_args,                  # training arguments, defined above
            train_dataset=my_model.training_set,          # training dataset
            eval_dataset=my_model.eval_set,               # evaluation dataset
            compute_metrics=my_model.compute_metrics,   # the callback that computes metrics of interest
        )

tune_config = {
            "per_device_train_batch_size": 15,
            "per_device_eval_batch_size": 15,
            "num_train_epochs": tune.choice([2, 3, 4, 5]),
            "max_steps": 100,
        }

scheduler = PopulationBasedTraining(
            time_attr="training_iteration",
            metric="eval_f1",
            mode="max",
            perturbation_interval=1,
            hyperparam_mutations={
                "weight_decay": tune.uniform(0.0, 0.3),
                "learning_rate": tune.uniform(1e-5, 5e-5),
                "per_device_train_batch_size": [8, 16],
         })

reporter = CLIReporter(
            parameter_columns={
                "weight_decay": "w_decay",
                "learning_rate": "lr",
                "per_device_train_batch_size": "train_bs/gpu",
                "num_train_epochs": "num_epochs"
            },
            metric_columns=[
                "eval_accuracy", "eval_precision", "eval_recall", "eval_f1"
        ])

best_run = trainer.hyperparameter_search(
            hp_space=lambda _: tune_config,
            backend="ray",
            n_trials=2,
            resources_per_trial={
                "cpu": 1,
                "gpu": 1
            },
            scheduler=scheduler,
            keep_checkpoints_num=1,
            checkpoint_score_attr="training_iteration",
            stop={"training_iteration": 1},
            progress_reporter=reporter,
            local_dir="~/ray_results/",
            name="tune_transformer_pbt",
            log_to_file=True)

I then run:

for n, v in best_run.hyperparameters.items():
    setattr(trainer.args, n, v)

trainer.train()

For what it’s worth:

hyperparameters={'per_device_train_batch_size': 15, 'per_device_eval_batch_size': 15, 'num_train_epochs': 5,
                 'max_steps': 100, 'weight_decay': 0.17959754525911098, 'learning_rate': 1.624074561769746e-05}

And when I look at trainer.args.num_train_epochs it does indeed equal 5.
But for some reason when I run trainer.train() I get the following(and it does only run for one epoch):

***** Running training *****
  Num examples = 8220
  Num Epochs = 1
  Instantaneous batch size per device = 15
  Total train batch size (w. parallel, distributed & accumulation) = 15
  Gradient Accumulation steps = 1
  Total optimization steps = 100

On another note, after training when I run `trainer.eval()’ I get an error roughly saying the NoneType doesn’t have an object ‘log_metrics’. Would anyone have any insight into that as well please?

Cheers.

laveena · September 8, 2021, 8:03pm

Can you please confirm? @sgugger I am able to perform hyperparamter search for all other params but not with epoch and batch_size. I assume because they play with number of instances that will run on each machine.

laveena · September 8, 2021, 8:04pm

Ray does not run with distributed data parallel.

laveena · September 8, 2021, 8:43pm

I am running a similar code, and it works for me. One thing is you do not set the epoch as a hyperparameter tuner because it always go in one direction.

sgugger · September 8, 2021, 8:58pm

As stated in the Trainer doc, max_steps overrides the value of num_train_epochs, so it’s logical that the value picked is not used.

soggles · September 9, 2021, 12:50pm

Understood- I now see the problem. Thank you @sgugger and @laveena very much for the quick reply.

Would you have any insight as to why I can’t run trainer.evaluate() now following training using the hyperparameters used in the hyperparameter_search.

for n, v in best_run.hyperparameters.items():
    setattr(trainer.args, n, v)
training_results = trainer.train()

#I have tried both of the following commented lines
# my_model.model.eval() 
# trainer.model.eval()
eval_results = trainer.evaluate()

Here is the output of the error that I receive. I just can’t figure out which of the changes I’ve made by training with Hyperparameter_search has caused trainer.evaluate() to operate differently for me now.

      1 trainer.model.eval()
----> 2 eval_results = trainer.evaluate()

~/anaconda3/envs/model-training/lib/python3.7/site-packages/transformers/trainer.py in evaluate(self, eval_dataset, ignore_keys, metric_key_prefix)
   2053         )
   2054 
-> 2055         self.log(output.metrics)
   2056 
   2057         if DebugOption.TPU_METRICS_DEBUG in self.args.debug:

~/anaconda3/envs/model-training/lib/python3.7/site-packages/transformers/trainer.py in log(self, logs)
   1718         output = {**logs, **{"step": self.state.global_step}}
   1719         self.state.log_history.append(output)
-> 1720         self.control = self.callback_handler.on_log(self.args, self.state, self.control, logs)
   1721 
   1722     def _prepare_inputs(self, inputs: Dict[str, Union[torch.Tensor, Any]]) -> Dict[str, Union[torch.Tensor, Any]]:

~/anaconda3/envs/model-training/lib/python3.7/site-packages/transformers/trainer_callback.py in on_log(self, args, state, control, logs)
    369     def on_log(self, args: TrainingArguments, state: TrainerState, control: TrainerControl, logs):
    370         control.should_log = False
--> 371         return self.call_event("on_log", args, state, control, logs=logs)
    372 
    373     def on_prediction_step(self, args: TrainingArguments, state: TrainerState, control: TrainerControl):

~/anaconda3/envs/model-training/lib/python3.7/site-packages/transformers/trainer_callback.py in call_event(self, event, args, state, control, **kwargs)
    386                 train_dataloader=self.train_dataloader,
    387                 eval_dataloader=self.eval_dataloader,
--> 388                 **kwargs,
    389             )
    390             # A Callback can skip the return of `control` if it doesn't change it.

~/anaconda3/envs/model-training/lib/python3.7/site-packages/transformers/integrations.py in on_log(self, args, state, control, logs, **kwargs)
    389             for k, v in logs.items():
    390                 if isinstance(v, (int, float)):
--> 391                     self.tb_writer.add_scalar(k, v, state.global_step)
    392                 else:
    393                     logger.warning(

~/anaconda3/envs/model-training/lib/python3.7/site-packages/tensorboardX/writer.py in add_scalar(self, tag, scalar_value, global_step, walltime, display_name, summary_description)
    451         self._get_file_writer().add_summary(
    452             scalar(tag, scalar_value, display_name, summary_description), global_step, walltime)
--> 453         self.comet_logger.log_metric(tag, display_name, scalar_value, global_step)
    454 
    455     def add_scalars(

AttributeError: 'NoneType' object has no attribute 'log_metric'

sgugger · September 9, 2021, 2:45pm

This looks like some problem in TensorBoard, from the stack trace.

soggles · September 9, 2021, 3:02pm

I wondered that too. But I’m not actually running Tensorboard - hence the confusion. If I train without tune and hyperparameter search there’s no issue.

I would’ve thought that once I ran trainer.train() successfully that it should be ok.

sgugger · September 9, 2021, 3:05pm

The Trainer uses TensorBoard by default if it’s install. You have to pass along report_to=[] in your training arguments to explicitly disable that.

soggles · September 10, 2021, 8:59am

You are correct. Thank you. I don’t fully understand why that field is being populated now by a different default value, but changing that has solved the issue.

Thank you for such quick responses.

dunalduck0 · December 7, 2021, 7:39am

I’d like to do hyperparameter search for GPT model. Can someone advise what I should use to create compute_metrics() for trainer.hyperparameter_search()?

lewtun · December 7, 2021, 8:20am

Hey @dunalduck0 one usually just tracks the loss or perplexity for GPT-like models. You can compute the losses by adapting the evaluation code in this example

Topic		Replies	Views
There is always something going wrong with hyper parameter tuning 🤗Transformers	4	1983	September 1, 2021
Hyperparameter search with wandb 🤗Transformers	1	233	July 28, 2024
Trainer.Hyperparameter_search() Trials did not complete. How to optimize parameters with ray tune? Beginners	0	941	January 10, 2023
Trainer.hyperparameter_search doesn't work for me Beginners	2	518	December 22, 2020
Hyper params search for model config 🤗Transformers	0	173	February 22, 2024

Using hyperparameter-search in Trainer

Related topics