Finetuning T5-large on Multiple GPUs

Hi,
I am trying to finetune a T5-large model on multiple GPUs on a cluster, and I got the following error message,

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1!

I am able to finetune T5-base on the same cluster.

I’d like to ask two questions,

  1. Is tensors being passed into different devices an expected behavior? Some tutorials (1, 2) suggest that lm_head has to be on the same device as the embedding layer, and I’ve made sure that mapping is correct. Is there anything else I’m missing?
  2. infer_auto_device_map() will try to load the entire model onto one gpu if I specify max_memory more than 1GiB, while the T5-large model is definitely much larger than 1GiB. I wonder if I’m using infer_auto_device_map() correctly?

Any feedback or suggestion will be greatly appreciated!


OS Versions of Cluster :

NAME="Red Hat Enterprise Linux Server"
VERSION="7.9 (Maipo)"
ID="rhel"
ID_LIKE="fedora"
VARIANT="Server"
VARIANT_ID="server"
VERSION_ID="7.9"
PRETTY_NAME="Red Hat Enterprise Linux Server 7.9 (Maipo)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:redhat:enterprise_linux:7.9:GA:server"

REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 7"
REDHAT_BUGZILLA_PRODUCT_VERSION=7.9
REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux"
REDHAT_SUPPORT_PRODUCT_VERSION="7.9"


Environment Package Versions :

python==3.9.16
torch==1.13.1+cu117
cudatoolkit==11.3.1
cuda version==11.7
torchvision==0.14.1
accelerate==0.17.1


Command :

accelerate launch s2s_hf_transformers.py --model=t5-large


Here are the main function and the TrainingArguments :

# main function
if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--data', type=str, default='data/data.pkl')
    parser.add_argument('--model', type=str, default='t5-large)
    args = parser.parse_args()
    max_memory={i: "1GiB" for i in range(2)}  # 2 GPUs
    config = T5Config.from_pretrained(args.model)
    tokenizer = T5Tokenizer.from_pretrained(args.model)
    with init_empty_weights():
        model = T5ForConditionalGeneration(config)
        device_map = infer_auto_device_map(model, no_split_module_classes=["T5DecoderLayer", "T5EncoderLayer"], dtype=torch.float16, max_memory=max_memory)
    device_map['lm_head'] = 0
    print(device_map)
    model = T5ForConditionalGeneration.from_pretrained(args.model, device_map=device_map)
    if args.model in T5_MODELS:
        finetune_t5(args.model, tokenizer, args.data)
    else:
        raise TypeError(f"ERROR: unrecognized model, {args.model}")
# TrainingArguments
train_args = Seq2SeqTrainingArguments(
        output_dir=f"model/{model_name}",
        evaluation_strategy="steps",
        eval_steps=100,
        logging_strategy="steps",
        logging_steps=100,
        save_strategy="steps",
        save_steps=100,
        learning_rate=1e-4,
        per_device_train_batch_size=BATCH_SIZE,
        per_device_eval_batch_size=BATCH_SIZE,
        weight_decay=0.01,
        num_train_epochs=10,
        # fp16=True,
        predict_with_generate=True,
        metric_for_best_model="exact_match",
        load_best_model_at_end=True,
        save_total_limit=3,
        overwrite_output_dir=True,
        report_to="tensorboard",
    )


Complete error message (tqdm bars are removed):

The following values were not passed to `accelerate launch` and had defaults used instead:
	`--num_processes` was set to a value of `2`
		More than one GPU was found, enabling multi-GPU training.
		If this was unintended please pass in `--num_processes=1`.
	`--num_machines` was set to a value of `1`
	`--mixed_precision` was set to a value of `'no'`
	`--dynamo_backend` was set to a value of `'no'`
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
/users/USER/.local/lib/python3.9/site-packages/transformers/models/t5/tokenization_t5.py:163: FutureWarning: This tokenizer was incorrectly instantiated with a model max length of 512 which will be corrected in Transformers v5.
For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-large automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.
- To avoid this warning, please instantiate this tokenizer with `model_max_length` set to your preferred value.
  warnings.warn(
/users/USER/.local/lib/python3.9/site-packages/transformers/models/t5/tokenization_t5.py:163: FutureWarning: This tokenizer was incorrectly instantiated with a model max length of 512 which will be corrected in Transformers v5.
For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-large automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.
- To avoid this warning, please instantiate this tokenizer with `model_max_length` set to your preferred value.
  warnings.warn(
{'shared': 0, 'decoder': 0, 'encoder.embed_tokens': 0, 'encoder.block.0': 0, 'encoder.block.1': 0, 'encoder.block.2': 0, 'encoder.block.3': 0, 'encoder.block.4': 0, 'encoder.block.5.layer.0': 0, 'encoder.block.6': 1, 'encoder.block.7': 1, 'encoder.block.8': 1, 'encoder.block.9': 1, 'encoder.block.10': 1, 'encoder.block.11': 1, 'encoder.block.12': 1, 'encoder.block.13': 1, 'encoder.block.14': 1, 'encoder.block.15': 1, 'encoder.block.16': 1, 'encoder.block.17': 1, 'encoder.block.18': 1, 'encoder.block.19': 1, 'encoder.block.20': 1, 'encoder.block.21': 1, 'encoder.block.22': 1, 'encoder.block.23': 1, 'encoder.final_layer_norm': 1, 'encoder.dropout': 1, 'lm_head': 0, 'encoder.block.5.layer.1': 1}
{'shared': 0, 'decoder': 0, 'encoder.embed_tokens': 0, 'encoder.block.0': 0, 'encoder.block.1': 0, 'encoder.block.2': 0, 'encoder.block.3': 0, 'encoder.block.4': 0, 'encoder.block.5.layer.0': 0, 'encoder.block.6': 1, 'encoder.block.7': 1, 'encoder.block.8': 1, 'encoder.block.9': 1, 'encoder.block.10': 1, 'encoder.block.11': 1, 'encoder.block.12': 1, 'encoder.block.13': 1, 'encoder.block.14': 1, 'encoder.block.15': 1, 'encoder.block.16': 1, 'encoder.block.17': 1, 'encoder.block.18': 1, 'encoder.block.19': 1, 'encoder.block.20': 1, 'encoder.block.21': 1, 'encoder.block.22': 1, 'encoder.block.23': 1, 'encoder.final_layer_norm': 1, 'encoder.dropout': 1, 'lm_head': 0, 'encoder.block.5.layer.1': 1}
Map:   0%|          | 0/38930 [00:00<?, ? examples/s]
Map:   0%|          | 0/38930 [00:00<?, ? examples/s]/users/USER/.local/lib/python3.9/site-packages/transformers/tokenization_utils_base.py:3581: UserWarning: `as_target_tokenizer` is deprecated and will be removed in v5 of Transformers. You can tokenize your labels by using the argument `text_target` of the regular `__call__` method (either in the same call as your input texts if you use the same keyword arguments, or in a separate call.
  warnings.warn(
/users/USER/.local/lib/python3.9/site-packages/transformers/tokenization_utils_base.py:3581: UserWarning: `as_target_tokenizer` is deprecated and will be removed in v5 of Transformers. You can tokenize your labels by using the argument `text_target` of the regular `__call__` method (either in the same call as your input texts if you use the same keyword arguments, or in a separate call.
  warnings.warn(
*originally tqdm bars here*
The following columns in the training set don't have a corresponding argument in `T5ForConditionalGeneration.forward` and have been ignored: ltl_formula, utterance. If ltl_formula, utterance are not expected by `T5ForConditionalGeneration.forward`,  you can safely ignore this message.
/users/USER/.local/lib/python3.9/site-packages/transformers/optimization.py:306: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  warnings.warn(
/users/USER/.local/lib/python3.9/site-packages/transformers/optimization.py:306: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  warnings.warn(
***** Running training *****
  Num examples = 38930
  Num Epochs = 10
  Instantaneous batch size per device = 30
  Total train batch size (w. parallel, distributed & accumulation) = 60
  Gradient Accumulation steps = 1
  Total optimization steps = 6490
  Number of trainable parameters = 737668096
Traceback (most recent call last):
  File "/gpfs/data/stellex/USER/git/Lang2LTL/s2s_hf_transformers.py", line 133, in <module>
    finetune_t5(args.model, tokenizer, args.data)
  File "/gpfs/data/stellex/USER/git/Lang2LTL/s2s_hf_transformers.py", line 94, in finetune_t5
    trainer.train()
  File "/users/USER/.local/lib/python3.9/site-packages/transformers/trainer.py", line 1543, in train
    return inner_training_loop(
  File "/users/USER/.local/lib/python3.9/site-packages/transformers/trainer.py", line 1791, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/users/USER/.local/lib/python3.9/site-packages/transformers/trainer.py", line 2539, in training_step
    loss = self.compute_loss(model, inputs)
  File "/users/USER/.local/lib/python3.9/site-packages/transformers/trainer.py", line 2571, in compute_loss
    outputs = model(**inputs)
  File "/users/USER/anaconda/lang2ltl/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/users/USER/anaconda/lang2ltl/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1040, in forward
    output = self._run_ddp_forward(*inputs, **kwargs)
  File "/users/USER/anaconda/lang2ltl/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1000, in _run_ddp_forward
    return module_to_run(*inputs[0], **kwargs[0])
  File "/users/USER/anaconda/lang2ltl/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/users/USER/anaconda/lang2ltl/lib/python3.9/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/users/USER/.local/lib/python3.9/site-packages/transformers/models/t5/modeling_t5.py", line 1626, in forward
    encoder_outputs = self.encoder(
  File "/users/USER/anaconda/lang2ltl/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/users/USER/.local/lib/python3.9/site-packages/transformers/models/t5/modeling_t5.py", line 956, in forward
    inputs_embeds = self.embed_tokens(input_ids)
  File "/users/USER/anaconda/lang2ltl/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/users/USER/anaconda/lang2ltl/lib/python3.9/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/users/USER/anaconda/lang2ltl/lib/python3.9/site-packages/torch/nn/modules/sparse.py", line 160, in forward
    return F.embedding(
  File "/users/USER/anaconda/lang2ltl/lib/python3.9/site-packages/torch/nn/functional.py", line 2210, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument index in method wrapper__index_select)
  0%|          | 0/6490 [00:00<?, ?it/s]Traceback (most recent call last):
  File "/gpfs/data/stellex/USER/git/Lang2LTL/s2s_hf_transformers.py", line 133, in <module>
    finetune_t5(args.model, tokenizer, args.data)
  File "/gpfs/data/stellex/USER/git/Lang2LTL/s2s_hf_transformers.py", line 94, in finetune_t5
    trainer.train()
  File "/users/USER/.local/lib/python3.9/site-packages/transformers/trainer.py", line 1543, in train
    return inner_training_loop(
  File "/users/USER/.local/lib/python3.9/site-packages/transformers/trainer.py", line 1791, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/users/USER/.local/lib/python3.9/site-packages/transformers/trainer.py", line 2539, in training_step
    loss = self.compute_loss(model, inputs)
  File "/users/USER/.local/lib/python3.9/site-packages/transformers/trainer.py", line 2571, in compute_loss
    outputs = model(**inputs)
  File "/users/USER/anaconda/lang2ltl/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/users/USER/anaconda/lang2ltl/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1040, in forward
    output = self._run_ddp_forward(*inputs, **kwargs)
  File "/users/USER/anaconda/lang2ltl/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1000, in _run_ddp_forward
    return module_to_run(*inputs[0], **kwargs[0])
  File "/users/USER/anaconda/lang2ltl/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/users/USER/anaconda/lang2ltl/lib/python3.9/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/users/USER/.local/lib/python3.9/site-packages/transformers/models/t5/modeling_t5.py", line 1626, in forward
    encoder_outputs = self.encoder(
  File "/users/USER/anaconda/lang2ltl/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/users/USER/.local/lib/python3.9/site-packages/transformers/models/t5/modeling_t5.py", line 1055, in forward
    layer_outputs = layer_module(
  File "/users/USER/anaconda/lang2ltl/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/users/USER/.local/lib/python3.9/site-packages/transformers/models/t5/modeling_t5.py", line 739, in forward
    hidden_states = self.layer[-1](hidden_states)
  File "/users/USER/anaconda/lang2ltl/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/users/USER/anaconda/lang2ltl/lib/python3.9/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/users/USER/.local/lib/python3.9/site-packages/transformers/models/t5/modeling_t5.py", line 335, in forward
    forwarded_states = self.layer_norm(hidden_states)
  File "/users/USER/anaconda/lang2ltl/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/users/USER/anaconda/lang2ltl/lib/python3.9/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/users/USER/.local/lib/python3.9/site-packages/transformers/models/t5/modeling_t5.py", line 261, in forward
    return self.weight * hidden_states
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1!
  0%|          | 0/6490 [00:00<?, ?it/s]
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 22547) of binary: /users/USER/anaconda/lang2ltl/bin/python
Traceback (most recent call last):
  File "/users/USER/anaconda/lang2ltl/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/users/USER/anaconda/lang2ltl/lib/python3.9/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
    args.func(args)
  File "/users/USER/anaconda/lang2ltl/lib/python3.9/site-packages/accelerate/commands/launch.py", line 906, in launch_command
    multi_gpu_launcher(args)
  File "/users/USER/anaconda/lang2ltl/lib/python3.9/site-packages/accelerate/commands/launch.py", line 599, in multi_gpu_launcher
    distrib_run.run(args)
  File "/users/USER/anaconda/lang2ltl/lib/python3.9/site-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/users/USER/anaconda/lang2ltl/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/users/USER/anaconda/lang2ltl/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: