Hi @pierreguillou, were you able to solve the error ModuleNotFoundError: No module named 'datasets_modules'
? Iām getting the same error followed by another TuneError("Trials did not complete", incomplete_trials)
Hi @sonam / @brgsk
Were you able to resolve this issue?
I am getting the same error. Kindly help.
Thanks!
Hi @pierreguillou & @kaankork
Were you able to resolve this issue? I am getting the same error.
Kindly help.
Thanks!
Hi @brgsk - thanks for that!
I have upgraded to the latest version and I am not getting that error anymore.
Thanks for your help
I am getting very stranger behavior on running trainer.hyperparameter_search with Pytorch Distributed data parallel with parameters batch_size. I am using two GPU nodes.
It sends different trials on the both GPUs with on one batch_size =4 and on the other batch_size=8.
It suppose to run one trial in distributed fashion in both the nodes. Please help.
training_args = TrainingArguments(
output_dir=self.train_out_dir, # output directory
logging_dir=self.train_log_dir, # directory for storing logs
num_train_epochs=self.train_param_nb_epochs, # total # of training epochs
per_device_train_batch_size=self.train_param_per_device_train_batch_size, # batch size per device during training
per_device_eval_batch_size=self.train_param_per_device_eval_batch_size, # batch size for evaluation
warmup_steps=self.train_param_warmup_steps, # number of warmup steps for learning rate scheduler
weight_decay=self.train_param_weight_decay, # strength of weight decay
learning_rate = self.train_param_learning_rate, # args.learning_rate - default is 5e-5, our notebook had 2e-5
adam_epsilon = self.train_param_adam_epsilon
)
def model_init():
print("-----------------")
print(model_full_path)
print(self.nb_labels)
print("-----------------")
return AutoModelForSequenceClassification.from_pretrained(model_full_path,
num_labels = self.nb_labels, # The number of output labels--2 for binary classification.
output_attentions = False, # Whether the model returns attentions weights.
output_hidden_states = False, # Whether the model returns all hidden-states.
)
from ray.tune.examples.pbt_transformers import utils
trainer = Trainer(
model_init=model_init, # the instantiated Ć°ÅøĀ¤ā Transformers model to be trained
args=training_args, # training arguments, defined above
train_dataset=train_dataset, # training dataset
eval_dataset=val_dataset, # evaluation dataset
compute_metrics=utils.build_compute_metrics_fn('rte')
)
def hp_space(self, trial):
return {
"learning_rate": trial.suggest_float("learning_rate", self.tune_param_learning_rate[0], self.tune_param_learning_rate[1]),
"per_device_train_batch_size": trial.suggest_int("per_device_train_batch_size", 4, 8, step=4),
}
best_run = trainer.hyperparameter_search(
direction="maximize",
hp_space = self.hp_space,
backend=self.backend,
n_trials = self.tune_param_n_trials,
)
Iām not sure you can combine HP search with distributed training.
Havenāt dug deep, but maybe this helps? It does require you to launch ray on the cluster first, though.
Hi there,
Would anyone be able to help me understand why, after updating my training args following a hyperparameter_search, num_epochs = 1
in the trainer.train()
stage despite trainer.args.num_train_epochs = 5
I currently run hyperparameter_search- my_model
is a class that holds all the class variables for now:
import ray
from ray import tune
from ray.tune import CLIReporter
from ray.tune.schedulers import PopulationBasedTraining
from transformers import Trainer, TrainingArguments
training_args = TrainingArguments(
output_dir=f'{my_model.save_directory}/results', # output directory
num_train_epochs=my_model.number_of_epochs, # total number of training epochs
per_device_train_batch_size=my_model.batch_size, # batch size per device during training
per_device_eval_batch_size=my_model.batch_size, # batch size for evaluation
warmup_steps=my_model.warmup_steps, # number of warmup steps for learning rate scheduler
weight_decay=my_model.weight_decay, # strength of weight decay
logging_dir=f'{my_model.save_directory}/logs', # directory for storing logs
load_best_model_at_end=True, # load the best model when finished training (default metric is loss)
# but you can specify `metric_for_best_model` argument to change to accuracy or other metric
logging_steps=my_model.logging_steps, # log & save weights each logging_steps
evaluation_strategy="steps", # evaluate each `logging_steps`
)
trainer = Trainer(
model_init=my_model.get_model, # the instantiated Transformers model to be trained
model=my_model.model,
args=training_args, # training arguments, defined above
train_dataset=my_model.training_set, # training dataset
eval_dataset=my_model.eval_set, # evaluation dataset
compute_metrics=my_model.compute_metrics, # the callback that computes metrics of interest
)
tune_config = {
"per_device_train_batch_size": 15,
"per_device_eval_batch_size": 15,
"num_train_epochs": tune.choice([2, 3, 4, 5]),
"max_steps": 100,
}
scheduler = PopulationBasedTraining(
time_attr="training_iteration",
metric="eval_f1",
mode="max",
perturbation_interval=1,
hyperparam_mutations={
"weight_decay": tune.uniform(0.0, 0.3),
"learning_rate": tune.uniform(1e-5, 5e-5),
"per_device_train_batch_size": [8, 16],
})
reporter = CLIReporter(
parameter_columns={
"weight_decay": "w_decay",
"learning_rate": "lr",
"per_device_train_batch_size": "train_bs/gpu",
"num_train_epochs": "num_epochs"
},
metric_columns=[
"eval_accuracy", "eval_precision", "eval_recall", "eval_f1"
])
best_run = trainer.hyperparameter_search(
hp_space=lambda _: tune_config,
backend="ray",
n_trials=2,
resources_per_trial={
"cpu": 1,
"gpu": 1
},
scheduler=scheduler,
keep_checkpoints_num=1,
checkpoint_score_attr="training_iteration",
stop={"training_iteration": 1},
progress_reporter=reporter,
local_dir="~/ray_results/",
name="tune_transformer_pbt",
log_to_file=True)
I then run:
for n, v in best_run.hyperparameters.items():
setattr(trainer.args, n, v)
trainer.train()
For what itās worth:
hyperparameters={'per_device_train_batch_size': 15, 'per_device_eval_batch_size': 15, 'num_train_epochs': 5,
'max_steps': 100, 'weight_decay': 0.17959754525911098, 'learning_rate': 1.624074561769746e-05}
And when I look at trainer.args.num_train_epochs
it does indeed equal 5.
But for some reason when I run trainer.train()
I get the following(and it does only run for one epoch):
***** Running training *****
Num examples = 8220
Num Epochs = 1
Instantaneous batch size per device = 15
Total train batch size (w. parallel, distributed & accumulation) = 15
Gradient Accumulation steps = 1
Total optimization steps = 100
On another note, after training when I run `trainer.eval()ā I get an error roughly saying the NoneType doesnāt have an object ālog_metricsā. Would anyone have any insight into that as well please?
Cheers.
Can you please confirm? @sgugger I am able to perform hyperparamter search for all other params but not with epoch and batch_size. I assume because they play with number of instances that will run on each machine.
Ray does not run with distributed data parallel.
I am running a similar code, and it works for me. One thing is you do not set the epoch as a hyperparameter tuner because it always go in one direction.
As stated in the Trainer doc, max_steps
overrides the value of num_train_epochs
, so itās logical that the value picked is not used.
Understood- I now see the problem. Thank you @sgugger and @laveena very much for the quick reply.
Would you have any insight as to why I canāt run trainer.evaluate()
now following training using the hyperparameters used in the hyperparameter_search.
for n, v in best_run.hyperparameters.items():
setattr(trainer.args, n, v)
training_results = trainer.train()
#I have tried both of the following commented lines
# my_model.model.eval()
# trainer.model.eval()
eval_results = trainer.evaluate()
Here is the output of the error that I receive. I just canāt figure out which of the changes Iāve made by training with Hyperparameter_search has caused trainer.evaluate()
to operate differently for me now.
1 trainer.model.eval()
----> 2 eval_results = trainer.evaluate()
~/anaconda3/envs/model-training/lib/python3.7/site-packages/transformers/trainer.py in evaluate(self, eval_dataset, ignore_keys, metric_key_prefix)
2053 )
2054
-> 2055 self.log(output.metrics)
2056
2057 if DebugOption.TPU_METRICS_DEBUG in self.args.debug:
~/anaconda3/envs/model-training/lib/python3.7/site-packages/transformers/trainer.py in log(self, logs)
1718 output = {**logs, **{"step": self.state.global_step}}
1719 self.state.log_history.append(output)
-> 1720 self.control = self.callback_handler.on_log(self.args, self.state, self.control, logs)
1721
1722 def _prepare_inputs(self, inputs: Dict[str, Union[torch.Tensor, Any]]) -> Dict[str, Union[torch.Tensor, Any]]:
~/anaconda3/envs/model-training/lib/python3.7/site-packages/transformers/trainer_callback.py in on_log(self, args, state, control, logs)
369 def on_log(self, args: TrainingArguments, state: TrainerState, control: TrainerControl, logs):
370 control.should_log = False
--> 371 return self.call_event("on_log", args, state, control, logs=logs)
372
373 def on_prediction_step(self, args: TrainingArguments, state: TrainerState, control: TrainerControl):
~/anaconda3/envs/model-training/lib/python3.7/site-packages/transformers/trainer_callback.py in call_event(self, event, args, state, control, **kwargs)
386 train_dataloader=self.train_dataloader,
387 eval_dataloader=self.eval_dataloader,
--> 388 **kwargs,
389 )
390 # A Callback can skip the return of `control` if it doesn't change it.
~/anaconda3/envs/model-training/lib/python3.7/site-packages/transformers/integrations.py in on_log(self, args, state, control, logs, **kwargs)
389 for k, v in logs.items():
390 if isinstance(v, (int, float)):
--> 391 self.tb_writer.add_scalar(k, v, state.global_step)
392 else:
393 logger.warning(
~/anaconda3/envs/model-training/lib/python3.7/site-packages/tensorboardX/writer.py in add_scalar(self, tag, scalar_value, global_step, walltime, display_name, summary_description)
451 self._get_file_writer().add_summary(
452 scalar(tag, scalar_value, display_name, summary_description), global_step, walltime)
--> 453 self.comet_logger.log_metric(tag, display_name, scalar_value, global_step)
454
455 def add_scalars(
AttributeError: 'NoneType' object has no attribute 'log_metric'
This looks like some problem in TensorBoard, from the stack trace.
I wondered that too. But Iām not actually running Tensorboard - hence the confusion. If I train without tune and hyperparameter search thereās no issue.
I wouldāve thought that once I ran trainer.train()
successfully that it should be ok.
The Trainer uses TensorBoard by default if itās install. You have to pass along report_to=[]
in your training arguments to explicitly disable that.
You are correct. Thank you. I donāt fully understand why that field is being populated now by a different default value, but changing that has solved the issue.
Thank you for such quick responses.
Iād like to do hyperparameter search for GPT model. Can someone advise what I should use to create compute_metrics() for trainer.hyperparameter_search()?
Hey @dunalduck0 one usually just tracks the loss or perplexity for GPT-like models. You can compute the losses by adapting the evaluation code in this example