Hi,
I am trying to finetune a T5-large model on multiple GPUs on a cluster, and I got the following error message,
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1!
I am able to finetune T5-base on the same cluster.
I’d like to ask two questions,
- Is tensors being passed into different devices an expected behavior? Some tutorials (1, 2) suggest that
lm_head
has to be on the same device as the embedding layer, and I’ve made sure that mapping is correct. Is there anything else I’m missing? -
infer_auto_device_map()
will try to load the entire model onto one gpu if I specifymax_memory
more than 1GiB, while the T5-large model is definitely much larger than 1GiB. I wonder if I’m usinginfer_auto_device_map()
correctly?
Any feedback or suggestion will be greatly appreciated!
OS Versions of Cluster :
NAME="Red Hat Enterprise Linux Server"
VERSION="7.9 (Maipo)"
ID="rhel"
ID_LIKE="fedora"
VARIANT="Server"
VARIANT_ID="server"
VERSION_ID="7.9"
PRETTY_NAME="Red Hat Enterprise Linux Server 7.9 (Maipo)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:redhat:enterprise_linux:7.9:GA:server"
REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 7"
REDHAT_BUGZILLA_PRODUCT_VERSION=7.9
REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux"
REDHAT_SUPPORT_PRODUCT_VERSION="7.9"
Environment Package Versions :
python==3.9.16
torch==1.13.1+cu117
cudatoolkit==11.3.1
cuda version==11.7
torchvision==0.14.1
accelerate==0.17.1
Command :
accelerate launch s2s_hf_transformers.py --model=t5-large
Here are the main function and the TrainingArguments :
# main function
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument('--data', type=str, default='data/data.pkl')
parser.add_argument('--model', type=str, default='t5-large)
args = parser.parse_args()
max_memory={i: "1GiB" for i in range(2)} # 2 GPUs
config = T5Config.from_pretrained(args.model)
tokenizer = T5Tokenizer.from_pretrained(args.model)
with init_empty_weights():
model = T5ForConditionalGeneration(config)
device_map = infer_auto_device_map(model, no_split_module_classes=["T5DecoderLayer", "T5EncoderLayer"], dtype=torch.float16, max_memory=max_memory)
device_map['lm_head'] = 0
print(device_map)
model = T5ForConditionalGeneration.from_pretrained(args.model, device_map=device_map)
if args.model in T5_MODELS:
finetune_t5(args.model, tokenizer, args.data)
else:
raise TypeError(f"ERROR: unrecognized model, {args.model}")
# TrainingArguments
train_args = Seq2SeqTrainingArguments(
output_dir=f"model/{model_name}",
evaluation_strategy="steps",
eval_steps=100,
logging_strategy="steps",
logging_steps=100,
save_strategy="steps",
save_steps=100,
learning_rate=1e-4,
per_device_train_batch_size=BATCH_SIZE,
per_device_eval_batch_size=BATCH_SIZE,
weight_decay=0.01,
num_train_epochs=10,
# fp16=True,
predict_with_generate=True,
metric_for_best_model="exact_match",
load_best_model_at_end=True,
save_total_limit=3,
overwrite_output_dir=True,
report_to="tensorboard",
)
Complete error message (tqdm bars are removed):
The following values were not passed to `accelerate launch` and had defaults used instead:
`--num_processes` was set to a value of `2`
More than one GPU was found, enabling multi-GPU training.
If this was unintended please pass in `--num_processes=1`.
`--num_machines` was set to a value of `1`
`--mixed_precision` was set to a value of `'no'`
`--dynamo_backend` was set to a value of `'no'`
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
/users/USER/.local/lib/python3.9/site-packages/transformers/models/t5/tokenization_t5.py:163: FutureWarning: This tokenizer was incorrectly instantiated with a model max length of 512 which will be corrected in Transformers v5.
For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-large automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.
- To avoid this warning, please instantiate this tokenizer with `model_max_length` set to your preferred value.
warnings.warn(
/users/USER/.local/lib/python3.9/site-packages/transformers/models/t5/tokenization_t5.py:163: FutureWarning: This tokenizer was incorrectly instantiated with a model max length of 512 which will be corrected in Transformers v5.
For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-large automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.
- To avoid this warning, please instantiate this tokenizer with `model_max_length` set to your preferred value.
warnings.warn(
{'shared': 0, 'decoder': 0, 'encoder.embed_tokens': 0, 'encoder.block.0': 0, 'encoder.block.1': 0, 'encoder.block.2': 0, 'encoder.block.3': 0, 'encoder.block.4': 0, 'encoder.block.5.layer.0': 0, 'encoder.block.6': 1, 'encoder.block.7': 1, 'encoder.block.8': 1, 'encoder.block.9': 1, 'encoder.block.10': 1, 'encoder.block.11': 1, 'encoder.block.12': 1, 'encoder.block.13': 1, 'encoder.block.14': 1, 'encoder.block.15': 1, 'encoder.block.16': 1, 'encoder.block.17': 1, 'encoder.block.18': 1, 'encoder.block.19': 1, 'encoder.block.20': 1, 'encoder.block.21': 1, 'encoder.block.22': 1, 'encoder.block.23': 1, 'encoder.final_layer_norm': 1, 'encoder.dropout': 1, 'lm_head': 0, 'encoder.block.5.layer.1': 1}
{'shared': 0, 'decoder': 0, 'encoder.embed_tokens': 0, 'encoder.block.0': 0, 'encoder.block.1': 0, 'encoder.block.2': 0, 'encoder.block.3': 0, 'encoder.block.4': 0, 'encoder.block.5.layer.0': 0, 'encoder.block.6': 1, 'encoder.block.7': 1, 'encoder.block.8': 1, 'encoder.block.9': 1, 'encoder.block.10': 1, 'encoder.block.11': 1, 'encoder.block.12': 1, 'encoder.block.13': 1, 'encoder.block.14': 1, 'encoder.block.15': 1, 'encoder.block.16': 1, 'encoder.block.17': 1, 'encoder.block.18': 1, 'encoder.block.19': 1, 'encoder.block.20': 1, 'encoder.block.21': 1, 'encoder.block.22': 1, 'encoder.block.23': 1, 'encoder.final_layer_norm': 1, 'encoder.dropout': 1, 'lm_head': 0, 'encoder.block.5.layer.1': 1}
Map: 0%| | 0/38930 [00:00<?, ? examples/s]
Map: 0%| | 0/38930 [00:00<?, ? examples/s]/users/USER/.local/lib/python3.9/site-packages/transformers/tokenization_utils_base.py:3581: UserWarning: `as_target_tokenizer` is deprecated and will be removed in v5 of Transformers. You can tokenize your labels by using the argument `text_target` of the regular `__call__` method (either in the same call as your input texts if you use the same keyword arguments, or in a separate call.
warnings.warn(
/users/USER/.local/lib/python3.9/site-packages/transformers/tokenization_utils_base.py:3581: UserWarning: `as_target_tokenizer` is deprecated and will be removed in v5 of Transformers. You can tokenize your labels by using the argument `text_target` of the regular `__call__` method (either in the same call as your input texts if you use the same keyword arguments, or in a separate call.
warnings.warn(
*originally tqdm bars here*
The following columns in the training set don't have a corresponding argument in `T5ForConditionalGeneration.forward` and have been ignored: ltl_formula, utterance. If ltl_formula, utterance are not expected by `T5ForConditionalGeneration.forward`, you can safely ignore this message.
/users/USER/.local/lib/python3.9/site-packages/transformers/optimization.py:306: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
warnings.warn(
/users/USER/.local/lib/python3.9/site-packages/transformers/optimization.py:306: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
warnings.warn(
***** Running training *****
Num examples = 38930
Num Epochs = 10
Instantaneous batch size per device = 30
Total train batch size (w. parallel, distributed & accumulation) = 60
Gradient Accumulation steps = 1
Total optimization steps = 6490
Number of trainable parameters = 737668096
Traceback (most recent call last):
File "/gpfs/data/stellex/USER/git/Lang2LTL/s2s_hf_transformers.py", line 133, in <module>
finetune_t5(args.model, tokenizer, args.data)
File "/gpfs/data/stellex/USER/git/Lang2LTL/s2s_hf_transformers.py", line 94, in finetune_t5
trainer.train()
File "/users/USER/.local/lib/python3.9/site-packages/transformers/trainer.py", line 1543, in train
return inner_training_loop(
File "/users/USER/.local/lib/python3.9/site-packages/transformers/trainer.py", line 1791, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/users/USER/.local/lib/python3.9/site-packages/transformers/trainer.py", line 2539, in training_step
loss = self.compute_loss(model, inputs)
File "/users/USER/.local/lib/python3.9/site-packages/transformers/trainer.py", line 2571, in compute_loss
outputs = model(**inputs)
File "/users/USER/anaconda/lang2ltl/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/users/USER/anaconda/lang2ltl/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1040, in forward
output = self._run_ddp_forward(*inputs, **kwargs)
File "/users/USER/anaconda/lang2ltl/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1000, in _run_ddp_forward
return module_to_run(*inputs[0], **kwargs[0])
File "/users/USER/anaconda/lang2ltl/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/users/USER/anaconda/lang2ltl/lib/python3.9/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/users/USER/.local/lib/python3.9/site-packages/transformers/models/t5/modeling_t5.py", line 1626, in forward
encoder_outputs = self.encoder(
File "/users/USER/anaconda/lang2ltl/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/users/USER/.local/lib/python3.9/site-packages/transformers/models/t5/modeling_t5.py", line 956, in forward
inputs_embeds = self.embed_tokens(input_ids)
File "/users/USER/anaconda/lang2ltl/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/users/USER/anaconda/lang2ltl/lib/python3.9/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/users/USER/anaconda/lang2ltl/lib/python3.9/site-packages/torch/nn/modules/sparse.py", line 160, in forward
return F.embedding(
File "/users/USER/anaconda/lang2ltl/lib/python3.9/site-packages/torch/nn/functional.py", line 2210, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument index in method wrapper__index_select)
0%| | 0/6490 [00:00<?, ?it/s]Traceback (most recent call last):
File "/gpfs/data/stellex/USER/git/Lang2LTL/s2s_hf_transformers.py", line 133, in <module>
finetune_t5(args.model, tokenizer, args.data)
File "/gpfs/data/stellex/USER/git/Lang2LTL/s2s_hf_transformers.py", line 94, in finetune_t5
trainer.train()
File "/users/USER/.local/lib/python3.9/site-packages/transformers/trainer.py", line 1543, in train
return inner_training_loop(
File "/users/USER/.local/lib/python3.9/site-packages/transformers/trainer.py", line 1791, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/users/USER/.local/lib/python3.9/site-packages/transformers/trainer.py", line 2539, in training_step
loss = self.compute_loss(model, inputs)
File "/users/USER/.local/lib/python3.9/site-packages/transformers/trainer.py", line 2571, in compute_loss
outputs = model(**inputs)
File "/users/USER/anaconda/lang2ltl/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/users/USER/anaconda/lang2ltl/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1040, in forward
output = self._run_ddp_forward(*inputs, **kwargs)
File "/users/USER/anaconda/lang2ltl/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1000, in _run_ddp_forward
return module_to_run(*inputs[0], **kwargs[0])
File "/users/USER/anaconda/lang2ltl/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/users/USER/anaconda/lang2ltl/lib/python3.9/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/users/USER/.local/lib/python3.9/site-packages/transformers/models/t5/modeling_t5.py", line 1626, in forward
encoder_outputs = self.encoder(
File "/users/USER/anaconda/lang2ltl/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/users/USER/.local/lib/python3.9/site-packages/transformers/models/t5/modeling_t5.py", line 1055, in forward
layer_outputs = layer_module(
File "/users/USER/anaconda/lang2ltl/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/users/USER/.local/lib/python3.9/site-packages/transformers/models/t5/modeling_t5.py", line 739, in forward
hidden_states = self.layer[-1](hidden_states)
File "/users/USER/anaconda/lang2ltl/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/users/USER/anaconda/lang2ltl/lib/python3.9/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/users/USER/.local/lib/python3.9/site-packages/transformers/models/t5/modeling_t5.py", line 335, in forward
forwarded_states = self.layer_norm(hidden_states)
File "/users/USER/anaconda/lang2ltl/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/users/USER/anaconda/lang2ltl/lib/python3.9/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/users/USER/.local/lib/python3.9/site-packages/transformers/models/t5/modeling_t5.py", line 261, in forward
return self.weight * hidden_states
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1!
0%| | 0/6490 [00:00<?, ?it/s]
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 22547) of binary: /users/USER/anaconda/lang2ltl/bin/python
Traceback (most recent call last):
File "/users/USER/anaconda/lang2ltl/bin/accelerate", line 8, in <module>
sys.exit(main())
File "/users/USER/anaconda/lang2ltl/lib/python3.9/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
args.func(args)
File "/users/USER/anaconda/lang2ltl/lib/python3.9/site-packages/accelerate/commands/launch.py", line 906, in launch_command
multi_gpu_launcher(args)
File "/users/USER/anaconda/lang2ltl/lib/python3.9/site-packages/accelerate/commands/launch.py", line 599, in multi_gpu_launcher
distrib_run.run(args)
File "/users/USER/anaconda/lang2ltl/lib/python3.9/site-packages/torch/distributed/run.py", line 753, in run
elastic_launch(
File "/users/USER/anaconda/lang2ltl/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/users/USER/anaconda/lang2ltl/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: