Sizes of tensors must match except in dimension 0

suranap · September 25, 2023, 6:55pm

I’m trying to pre-train BERT with my dataset using Trainer and torchrun. It works on 1 GPU, fails on 2 GPUs. I’m running with torchrun --nproc-per-node 2 my_benchmark.py

RuntimeError: Sizes of tensors must match except in dimension 0. Expected size 472 but got size 309 for tensor number 1 in the list.

472 is the length of an encoded sample. Why would there be a tensor of length 309 only when I run with 2 GPUs?

Here is an edited backtrace:

File "baseline_pytorch.py", line 112, in run
  trainer.train()
File "site-packages/transformers/trainer.py", line 1553, in train
  return inner_training_loop(
         ^^^^^^^^^^^^^^^^^^^^
File "site-packages/transformers/trainer.py", line 1813, in _inner_training_loop
  for step, inputs in enumerate(epoch_iterator):
File "site-packages/accelerate/data_loader.py", line 560, in __iter__
  next_batch, next_batch_info = self._fetch_batches(main_iterator)
                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "site-packages/accelerate/data_loader.py", line 524, in _fetch_batches
  batch = concatenate(batches, dim=0)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "site-packages/accelerate/utils/operations.py", line 496, in concatenate
  return type(data[0])({k: concatenate([d[k] for d in data], dim=dim) for k in data[0].keys()})
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "site-packages/accelerate/utils/operations.py", line 496, in <dictcomp>
  return type(data[0])({k: concatenate([d[k] for d in data], dim=dim) for k in data[0].keys()})
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "site-packages/accelerate/utils/operations.py", line 499, in concatenate
  return torch.cat(data, dim=dim)
         ^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Sizes of tensors must match except in dimension 0. Expected size 472 but got size 309 for tensor number 1 in the list.

And some training args:

TrainingArguments(
_n_gpu=1,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.98,
adam_epsilon=2e-08,
auto_find_batch_size=False,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_pin_memory=True,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=False,
dispatch_batches=None,
do_eval=False,
do_predict=False,
do_train=True,
eval_accumulation_steps=None,
eval_delay=0,
eval_steps=None,
evaluation_strategy=IntervalStrategy.NO,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False},
fsdp_min_num_params=0,
fsdp_transformer_layer_cls_to_wrap=None,
full_determinism=False,
gradient_accumulation_steps=1,
gradient_checkpointing=False,
greater_is_better=None,
group_by_length=False,
half_precision_backend=auto,
hub_always_push=False,
hub_model_id=None,
hub_private_repo=False,
hub_strategy=HubStrategy.EVERY_SAVE,
hub_token=<HUB_TOKEN>,
ignore_data_skip=False,
include_inputs_for_metrics=False,
jit_mode_eval=False,
label_names=None,
label_smoothing_factor=0.0,
learning_rate=0.0001,
length_column_name=length,
load_best_model_at_end=False,
local_rank=0,
log_level=passive,
log_level_replica=warning,
log_on_each_node=True,
logging_dir=somefile,
logging_first_step=False,
logging_nan_inf_filter=True,
logging_steps=500,
logging_strategy=IntervalStrategy.STEPS,
lr_scheduler_type=SchedulerType.LINEAR,
max_grad_norm=1.0,
max_steps=-1,
metric_for_best_model=None,
mp_parameters=,
no_cuda=False,
num_train_epochs=10,
optim=OptimizerNames.ADAMW_TORCH,
optim_args=None,
output_dir=somedir,
overwrite_output_dir=False,
past_index=-1,
per_device_eval_batch_size=8,
per_device_train_batch_size=12,
prediction_loss_only=False,
push_to_hub=False,
push_to_hub_model_id=None,
push_to_hub_organization=None,
push_to_hub_token=<PUSH_TO_HUB_TOKEN>,
ray_scope=last,
remove_unused_columns=True,
report_to=[],
resume_from_checkpoint=None,
run_name=somedir,
save_on_each_node=False,
save_safetensors=False,
save_steps=500,
save_strategy=IntervalStrategy.STEPS,
save_total_limit=None,
seed=7102,
sharded_ddp=[],
skip_memory_metrics=True,
tf32=None,
torch_compile=False,
torch_compile_backend=None,
torch_compile_mode=None,
torchdynamo=None,
tpu_metrics_debug=False,
tpu_num_cores=None,
use_cpu=False,
use_ipex=False,
use_legacy_prediction_loop=False,
use_mps_device=False,
warmup_ratio=0.0,
warmup_steps=1,
weight_decay=0.0,
)

jpgard · October 24, 2023, 11:24pm

This post is probably affected by this issue reported on Github: you may want to ping the HF maintainers there to let them know, if you are affected by this issue.

github.com/huggingface/transformers

Trainer errors out when concatenating different sequence length batches with distributed training and IterableDataset

opened 06:49PM - 02 Oct 23 UTC

ssharpe42

### System Info - `transformers` version: 4.33.3 - Platform: Linux-5.10.186-…179.751.amzn2.x86_64-x86_64-with-glibc2.10 - Python version: 3.8.17 - Huggingface_hub version: 0.17.3 - Safetensors version: 0.3.3 - Accelerate version: 0.23.0 - Accelerate config: not found - PyTorch version (GPU?): 2.0.1+cu117 (True) - Tensorflow version (GPU?): not installed (NA) - Flax version (CPU?/GPU?/TPU?): not installed (NA) - Jax version: not installed - JaxLib version: not installed - Using GPU in script?: A100 - Using distributed or parallel set-up in script?: torchrun --nproc-per-node 2 script.py ### Who can help? @muellerzr, @pacman100 ### Information - [ ] The official example scripts - [X] My own modified scripts ### Tasks - [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...) - [X] My own task or dataset (give details below) ### Reproduction ``` import torch from torch.utils.data import IterableDataset from transformers import ( AutoModelForMaskedLM, AutoTokenizer, DataCollatorForLanguageModeling, Trainer, TrainingArguments, ) data = [ { "input_ids": torch.tensor([101, 2040, 2001, 1999, 14936, 102]), "token_type_ids": torch.tensor([0, 0, 0, 0, 0, 0]), "attention_mask": torch.tensor([1, 1, 1, 1, 1, 1]), }, { "input_ids": torch.tensor([101, 2040, 102]), "token_type_ids": torch.tensor([0, 0, 0]), "attention_mask": torch.tensor([1, 1, 1]), }, { "input_ids": torch.tensor([101, 2040, 2001, 1999]), "token_type_ids": torch.tensor([0, 0, 0, 0]), "attention_mask": torch.tensor([1, 1, 1, 1]), }, { "input_ids": torch.tensor([101, 2040, 2001, 1999, 14936, 102]), "token_type_ids": torch.tensor([0, 0, 0, 0, 0, 0]), "attention_mask": torch.tensor([1, 1, 1, 1, 1, 1]), }, { "input_ids": torch.tensor([101]), "token_type_ids": torch.tensor([00]), "attention_mask": torch.tensor([1]), }, { "input_ids": torch.tensor([101]), "token_type_ids": torch.tensor([00]), "attention_mask": torch.tensor([1]), }, ] class ExampleDataset(IterableDataset): def __init__(self, data): super().__init__() self.data = data * 20 def __iter__(self): for x in self.data: yield x def __len__(self): return len(self.data) tokenizer = AutoTokenizer.from_pretrained("bert-base-cased") model = AutoModelForMaskedLM.from_pretrained("bert-base-cased") train_args = TrainingArguments( output_dir="output", num_train_epochs=3, per_device_train_batch_size=2, ) dc = DataCollatorForLanguageModeling(tokenizer=tokenizer) trainer = Trainer( train_dataset=ExampleDataset(data), model=model, args=train_args, data_collator=dc, ) trainer.train() ``` I run the above script with the command `torchrun --nproc-per-node 2 script.py`. This results in the following error. ``` Traceback (most recent call last): File "fm_model/data/scratch.py", line 242, in <module> trainer.train() File "/opt/conda/envs/fmmodel/lib/python3.8/site-packages/transformers/trainer.py", line 1556, in train return inner_training_loop( File "/opt/conda/envs/fmmodel/lib/python3.8/site-packages/transformers/trainer.py", line 1816, in _inner_training_loop for step, inputs in enumerate(epoch_iterator): File "/opt/conda/envs/fmmodel/lib/python3.8/site-packages/accelerate/data_loader.py", line 597, in __iter__ next_batch, next_batch_info = self._fetch_batches(main_iterator) File "/opt/conda/envs/fmmodel/lib/python3.8/site-packages/accelerate/data_loader.py", line 528, in _fetch_batches batch = concatenate(batches, dim=0) File "/opt/conda/envs/fmmodel/lib/python3.8/site-packages/accelerate/utils/operations.py", line 496, in concatenate return type(data[0])({k: concatenate([d[k] for d in data], dim=dim) for k in data[0].keys()}) File "/opt/conda/envs/fmmodel/lib/python3.8/site-packages/accelerate/utils/operations.py", line 496, in <dictcomp> return type(data[0])({k: concatenate([d[k] for d in data], dim=dim) for k in data[0].keys()}) File "/opt/conda/envs/fmmodel/lib/python3.8/site-packages/accelerate/utils/operations.py", line 499, in concatenate return torch.cat(data, dim=dim) RuntimeError: Sizes of tensors must match except in dimension 0. Expected size 1 but got size 6 for tensor number 1 in the list. ``` This is due to the fact that in `Trainer` there are no arguments that can be passed to prepare the dataloader with [split_batches](https://github.com/huggingface/accelerate/blob/48d96319e0033fb8c8979072d97edf3995639029/src/accelerate/data_loader.py#L515) so this errors out when running this [line](https://github.com/huggingface/accelerate/blob/69e4c3c54da3201eda288b500d138761e7a5221c/src/accelerate/data_loader.py#L481). This occurs since there is no padding done across batches before these are concatenated together. In order to be able to use an iterable dataset with Trainer, something probably needs to be changed in accelerate or the Trainer to enable distributed dataloading when the batches end up being different lengths. ### Expected behavior 1. Automatic padding in accelerate when the batches produced have different lengths OR 2. A way to specify split_batches where a full batch is produced then split for all the different processes

Topic		Replies	Views
Trainer errors out when concatenating different sequence length batches with distributed training and IterableDataset 🤗Transformers	0	205	October 2, 2023
RuntimeError: Sizes of tensors must match except in dimension 2. Expected size 8 but got size 64 for tensor number 1 in the list 🤗Transformers	1	1245	January 4, 2024
HuggingFace dataset: each element in list of batch should be of equal size 🤗Datasets	3	10435	August 10, 2023
Solving error for mismatch tensor size 🤗Transformers	0	329	April 14, 2024
RuntimeError: The expanded size of the tensor (31) must match the existing size (7) at non-singleton dimension 0. Target sizes: [31]. Tensor sizes: [7] Beginners	0	193	May 23, 2024

Sizes of tensors must match except in dimension 0

Related topics