Hi. Everyony.
Thanks for reading it.
I did a Summarization task for understanding fine-tune with mT5 model.
My HW has two GPUS, one is A5000, the other is A4000.
In Fine-tuning mT5 with the Trainer
API section.
When I run the âtrainer.train()â, I get the below errors.
/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/data_parallel.py:30: UserWarning:
There is an imbalance between your GPUs. You may want to exclude GPU 1 which
has less than 75% of the memory or cores of GPU 0. You can do so by setting
the device_ids argument to DataParallel, or by setting the CUDA_VISIBLE_DEVICES
environment variable.
warnings.warn(imbalance_warn.format(device_ids[min_pos], device_ids[max_pos]))
***** Running training *****
Num examples = 9672
Num Epochs = 8
Instantaneous batch size per device = 8
Total train batch size (w. parallel, distributed & accumulation) = 16
Gradient Accumulation steps = 1
Total optimization steps = 4840
/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/_functions.py:68: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
warnings.warn('Was asked to gather along dimension 0, but all '
[ 606/4840 03:42 < 25:56, 2.72 it/s, Epoch 1/8]
Epoch Training Loss Validation Loss
Saving model checkpoint to mt5-small-finetuned-amazon-en-es/checkpoint-500
Configuration saved in mt5-small-finetuned-amazon-en-es/checkpoint-500/config.json
Model weights saved in mt5-small-finetuned-amazon-en-es/checkpoint-500/pytorch_model.bin
tokenizer config file saved in mt5-small-finetuned-amazon-en-es/checkpoint-500/tokenizer_config.json
Special tokens file saved in mt5-small-finetuned-amazon-en-es/checkpoint-500/special_tokens_map.json
Copy vocab file to mt5-small-finetuned-amazon-en-es/checkpoint-500/spiece.model
/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/_functions.py:68: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
warnings.warn('Was asked to gather along dimension 0, but all '
/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/data_parallel.py:30: UserWarning:
There is an imbalance between your GPUs. You may want to exclude GPU 1 which
has less than 75% of the memory or cores of GPU 0. You can do so by setting
the device_ids argument to DataParallel, or by setting the CUDA_VISIBLE_DEVICES
environment variable.
warnings.warn(imbalance_warn.format(device_ids[min_pos], device_ids[max_pos]))
***** Running Evaluation *****
Num examples = 238
Batch size = 16
Output exceeds the size limit. Open the full output data in a text editor
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<ipython-input-32-3435b262f1ae> in <module>
----> 1 trainer.train()
/usr/local/lib/python3.8/dist-packages/transformers/trainer.py in train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
1346
1347 self.control = self.callback_handler.on_epoch_end(args, self.state, self.control)
-> 1348 self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
1349
1350 if DebugOption.TPU_METRICS_DEBUG in self.args.debug:
/usr/local/lib/python3.8/dist-packages/transformers/trainer.py in _maybe_log_save_evaluate(self, tr_loss, model, trial, epoch, ignore_keys_for_eval)
1443 metrics = None
1444 if self.control.should_evaluate:
-> 1445 metrics = self.evaluate(ignore_keys=ignore_keys_for_eval)
1446 self._report_to_hp_search(trial, epoch, metrics)
1447
/usr/local/lib/python3.8/dist-packages/transformers/trainer_seq2seq.py in evaluate(self, eval_dataset, ignore_keys, metric_key_prefix, max_length, num_beams)
73 self._max_length = max_length if max_length is not None else self.args.generation_max_length
74 self._num_beams = num_beams if num_beams is not None else self.args.generation_num_beams
---> 75 return super().evaluate(eval_dataset, ignore_keys=ignore_keys, metric_key_prefix=metric_key_prefix)
76
77 def predict(
...
-> 2183 return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
2184
2185
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument index in method wrapper__index_select)
I was trying with CUDA_VISIBLE_DEVICE=0 python train.py also. but Error is still happened.
I run this code with One GPU Machine. It is OK.
I want to run this code in the two GPUs Machine.
In Accelerate Section,
tokenized_datasets.set_format(âtorchâ) command makes an error when I run the training code.
Error Message is below
Traceback (most recent call last):
File "summary_train.py", line 211, in <module>
for step, batch in enumerate(train_dataloader):
File "/usr/local/lib/python3.8/dist-packages/accelerate/data_loader.py", line 301, in __iter__
for batch in super().__iter__():
File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 530, in __next__
data = self._next_data()
File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 570, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/_utils/fetch.py", line 52, in fetch
return self.collate_fn(data)
File "/usr/local/lib/python3.8/dist-packages/transformers/data/data_collator.py", line 531, in __call__
feature["labels"] + remainder if padding_side == "right" else remainder + feature["labels"]
TypeError: unsupported operand type(s) for +: 'Tensor' and 'list'
When I remove
tokenized_datasets.set_format("torch")
codes is OK.