I am using trainer to train a bart model on 4 gpus(one node), the command line goes like this:
python -m torch.distributed.launch my_file.py.
and code goes like:
args = Seq2SeqTrainingArguments(
"bart-large-copy",
evaluation_strategy = "epoch",
save_strategy='epoch',
learning_rate=2e-5,
weight_decay=0.001,
#lr_scheduler_type='cosine',
adam_beta1=0.9,
adam_beta2=0.98,
#warmup_steps=4000,
label_smoothing_factor=0.1,
per_device_train_batch_size=batch_size,
per_device_eval_batch_size=batch_size,
save_total_limit=3,
num_train_epochs=50,
predict_with_generate=True,
max_grad_norm=0.0,
load_best_model_at_end=True,
metric_for_best_model='sacrebleu',
greater_is_better=True,
fp16=False,
gradient_accumulation_steps=2,
)
trainer = Seq2SeqTrainer(
model=model,
args=args,
train_dataset=dataset['train'],
eval_dataset=dataset['dev'],
data_collator=data_collator,
compute_metrics=compute_metrics,
)
I got the error info as follows:
envs:
transformers: 4.16.2
torch: 1.9.0+rocm4.0.1
gpus: 4 amd Vega 20 66a1
I dont know why it doesn
t work.
wish anyone can give me some suggestion.