Using hyperparameter-search in Trainer

Hi @pierreguillou, were you able to solve the error ModuleNotFoundError: No module named 'datasets_modules'? Iā€™m getting the same error followed by another TuneError("Trials did not complete", incomplete_trials)

Hi @sonam / @brgsk
Were you able to resolve this issue?
I am getting the same error. Kindly help.

Thanks!

Hi @pierreguillou & @kaankork

Were you able to resolve this issue? I am getting the same error.
Kindly help.

Thanks!

Hi @Devanshi, are you on the latest Transformers version? I think this PR fixed this issue.

1 Like

Hi @brgsk - thanks for that!
I have upgraded to the latest version and I am not getting that error anymore.

Thanks for your help :slight_smile:

1 Like

I am getting very stranger behavior on running trainer.hyperparameter_search with Pytorch Distributed data parallel with parameters batch_size. I am using two GPU nodes.
It sends different trials on the both GPUs with on one batch_size =4 and on the other batch_size=8.

It suppose to run one trial in distributed fashion in both the nodes. Please help.

training_args = TrainingArguments(
            output_dir=self.train_out_dir,          # output directory
            logging_dir=self.train_log_dir,            # directory for storing logs
            num_train_epochs=self.train_param_nb_epochs,              # total # of training epochs
            per_device_train_batch_size=self.train_param_per_device_train_batch_size,  # batch size per device during training
            per_device_eval_batch_size=self.train_param_per_device_eval_batch_size,   # batch size for evaluation
            warmup_steps=self.train_param_warmup_steps,                # number of warmup steps for learning rate scheduler
            weight_decay=self.train_param_weight_decay,               # strength of weight decay
            learning_rate = self.train_param_learning_rate, # args.learning_rate - default is 5e-5, our notebook had 2e-5
            adam_epsilon = self.train_param_adam_epsilon
        )
        
        def model_init():
            print("-----------------")
            print(model_full_path)
            print(self.nb_labels)
            print("-----------------")

            return AutoModelForSequenceClassification.from_pretrained(model_full_path,
                num_labels = self.nb_labels, # The number of output labels--2 for binary classification.
                output_attentions = False, # Whether the model returns attentions weights.
                output_hidden_states = False, # Whether the model returns all hidden-states.
            )

        from ray.tune.examples.pbt_transformers import utils

        trainer = Trainer(
            model_init=model_init,                 # the instantiated Ć°ÅøĀ¤ā€” Transformers model to be trained
            args=training_args,          # training arguments, defined above
            train_dataset=train_dataset, # training dataset
            eval_dataset=val_dataset,     # evaluation dataset
            compute_metrics=utils.build_compute_metrics_fn('rte')
        )
  


        def hp_space(self, trial):
                return {
                    "learning_rate": trial.suggest_float("learning_rate", self.tune_param_learning_rate[0], self.tune_param_learning_rate[1]),
                  "per_device_train_batch_size": trial.suggest_int("per_device_train_batch_size", 4, 8, step=4),
            }
        
        best_run = trainer.hyperparameter_search(
            direction="maximize",
            hp_space = self.hp_space,
            backend=self.backend,
            n_trials = self.tune_param_n_trials,
        )

Iā€™m not sure you can combine HP search with distributed training.

1 Like

Havenā€™t dug deep, but maybe this helps? It does require you to launch ray on the cluster first, though.

Hi there,

Would anyone be able to help me understand why, after updating my training args following a hyperparameter_search, num_epochs = 1 in the trainer.train() stage despite trainer.args.num_train_epochs = 5

I currently run hyperparameter_search- my_model is a class that holds all the class variables for now:

import ray
from ray import tune
from ray.tune import CLIReporter
from ray.tune.schedulers import PopulationBasedTraining
from transformers import Trainer, TrainingArguments
training_args = TrainingArguments(
            output_dir=f'{my_model.save_directory}/results',                     # output directory
            num_train_epochs=my_model.number_of_epochs,     # total number of training epochs
            per_device_train_batch_size=my_model.batch_size,  # batch size per device during training
            per_device_eval_batch_size=my_model.batch_size,   # batch size for evaluation
            warmup_steps=my_model.warmup_steps,             # number of warmup steps for learning rate scheduler
            weight_decay=my_model.weight_decay,             # strength of weight decay
            logging_dir=f'{my_model.save_directory}/logs', # directory for storing logs
            load_best_model_at_end=True,     # load the best model when finished training (default metric is loss)
            # but you can specify `metric_for_best_model` argument to change to accuracy or other metric
            logging_steps=my_model.logging_steps,           # log & save weights each logging_steps
            evaluation_strategy="steps",     # evaluate each `logging_steps`
        )

trainer = Trainer(
            model_init=my_model.get_model,                    # the instantiated Transformers model to be trained
            model=my_model.model,
            args=training_args,                  # training arguments, defined above
            train_dataset=my_model.training_set,          # training dataset
            eval_dataset=my_model.eval_set,               # evaluation dataset
            compute_metrics=my_model.compute_metrics,   # the callback that computes metrics of interest
        )

tune_config = {
            "per_device_train_batch_size": 15,
            "per_device_eval_batch_size": 15,
            "num_train_epochs": tune.choice([2, 3, 4, 5]),
            "max_steps": 100,
        }

scheduler = PopulationBasedTraining(
            time_attr="training_iteration",
            metric="eval_f1",
            mode="max",
            perturbation_interval=1,
            hyperparam_mutations={
                "weight_decay": tune.uniform(0.0, 0.3),
                "learning_rate": tune.uniform(1e-5, 5e-5),
                "per_device_train_batch_size": [8, 16],
         })

reporter = CLIReporter(
            parameter_columns={
                "weight_decay": "w_decay",
                "learning_rate": "lr",
                "per_device_train_batch_size": "train_bs/gpu",
                "num_train_epochs": "num_epochs"
            },
            metric_columns=[
                "eval_accuracy", "eval_precision", "eval_recall", "eval_f1"
        ])

best_run = trainer.hyperparameter_search(
            hp_space=lambda _: tune_config,
            backend="ray",
            n_trials=2,
            resources_per_trial={
                "cpu": 1,
                "gpu": 1
            },
            scheduler=scheduler,
            keep_checkpoints_num=1,
            checkpoint_score_attr="training_iteration",
            stop={"training_iteration": 1},
            progress_reporter=reporter,
            local_dir="~/ray_results/",
            name="tune_transformer_pbt",
            log_to_file=True)

I then run:

for n, v in best_run.hyperparameters.items():
    setattr(trainer.args, n, v)

trainer.train()

For what itā€™s worth:

hyperparameters={'per_device_train_batch_size': 15, 'per_device_eval_batch_size': 15, 'num_train_epochs': 5,
                 'max_steps': 100, 'weight_decay': 0.17959754525911098, 'learning_rate': 1.624074561769746e-05}

And when I look at trainer.args.num_train_epochs it does indeed equal 5.
But for some reason when I run trainer.train() I get the following(and it does only run for one epoch):

***** Running training *****
  Num examples = 8220
  Num Epochs = 1
  Instantaneous batch size per device = 15
  Total train batch size (w. parallel, distributed & accumulation) = 15
  Gradient Accumulation steps = 1
  Total optimization steps = 100

On another note, after training when I run `trainer.eval()ā€™ I get an error roughly saying the NoneType doesnā€™t have an object ā€˜log_metricsā€™. Would anyone have any insight into that as well please?

Cheers.

Can you please confirm? @sgugger I am able to perform hyperparamter search for all other params but not with epoch and batch_size. I assume because they play with number of instances that will run on each machine.

Ray does not run with distributed data parallel.

I am running a similar code, and it works for me. One thing is you do not set the epoch as a hyperparameter tuner because it always go in one direction.

As stated in the Trainer doc, max_steps overrides the value of num_train_epochs, so itā€™s logical that the value picked is not used.

Understood- I now see the problem. Thank you @sgugger and @laveena very much for the quick reply.

Would you have any insight as to why I canā€™t run trainer.evaluate() now following training using the hyperparameters used in the hyperparameter_search.

for n, v in best_run.hyperparameters.items():
    setattr(trainer.args, n, v)
training_results = trainer.train()

#I have tried both of the following commented lines
# my_model.model.eval() 
# trainer.model.eval()
eval_results = trainer.evaluate()

Here is the output of the error that I receive. I just canā€™t figure out which of the changes Iā€™ve made by training with Hyperparameter_search has caused trainer.evaluate() to operate differently for me now.

      1 trainer.model.eval()
----> 2 eval_results = trainer.evaluate()

~/anaconda3/envs/model-training/lib/python3.7/site-packages/transformers/trainer.py in evaluate(self, eval_dataset, ignore_keys, metric_key_prefix)
   2053         )
   2054 
-> 2055         self.log(output.metrics)
   2056 
   2057         if DebugOption.TPU_METRICS_DEBUG in self.args.debug:

~/anaconda3/envs/model-training/lib/python3.7/site-packages/transformers/trainer.py in log(self, logs)
   1718         output = {**logs, **{"step": self.state.global_step}}
   1719         self.state.log_history.append(output)
-> 1720         self.control = self.callback_handler.on_log(self.args, self.state, self.control, logs)
   1721 
   1722     def _prepare_inputs(self, inputs: Dict[str, Union[torch.Tensor, Any]]) -> Dict[str, Union[torch.Tensor, Any]]:

~/anaconda3/envs/model-training/lib/python3.7/site-packages/transformers/trainer_callback.py in on_log(self, args, state, control, logs)
    369     def on_log(self, args: TrainingArguments, state: TrainerState, control: TrainerControl, logs):
    370         control.should_log = False
--> 371         return self.call_event("on_log", args, state, control, logs=logs)
    372 
    373     def on_prediction_step(self, args: TrainingArguments, state: TrainerState, control: TrainerControl):

~/anaconda3/envs/model-training/lib/python3.7/site-packages/transformers/trainer_callback.py in call_event(self, event, args, state, control, **kwargs)
    386                 train_dataloader=self.train_dataloader,
    387                 eval_dataloader=self.eval_dataloader,
--> 388                 **kwargs,
    389             )
    390             # A Callback can skip the return of `control` if it doesn't change it.

~/anaconda3/envs/model-training/lib/python3.7/site-packages/transformers/integrations.py in on_log(self, args, state, control, logs, **kwargs)
    389             for k, v in logs.items():
    390                 if isinstance(v, (int, float)):
--> 391                     self.tb_writer.add_scalar(k, v, state.global_step)
    392                 else:
    393                     logger.warning(

~/anaconda3/envs/model-training/lib/python3.7/site-packages/tensorboardX/writer.py in add_scalar(self, tag, scalar_value, global_step, walltime, display_name, summary_description)
    451         self._get_file_writer().add_summary(
    452             scalar(tag, scalar_value, display_name, summary_description), global_step, walltime)
--> 453         self.comet_logger.log_metric(tag, display_name, scalar_value, global_step)
    454 
    455     def add_scalars(

AttributeError: 'NoneType' object has no attribute 'log_metric'

This looks like some problem in TensorBoard, from the stack trace.

I wondered that too. But Iā€™m not actually running Tensorboard - hence the confusion. If I train without tune and hyperparameter search thereā€™s no issue.

I wouldā€™ve thought that once I ran trainer.train() successfully that it should be ok.

The Trainer uses TensorBoard by default if itā€™s install. You have to pass along report_to=[] in your training arguments to explicitly disable that.

You are correct. Thank you. I donā€™t fully understand why that field is being populated now by a different default value, but changing that has solved the issue.

Thank you for such quick responses.

1 Like

Iā€™d like to do hyperparameter search for GPT model. Can someone advise what I should use to create compute_metrics() for trainer.hyperparameter_search()?

Hey @dunalduck0 one usually just tracks the loss or perplexity for GPT-like models. You can compute the losses by adapting the evaluation code in this example :slight_smile: