Trainer errors out when concatenating different sequence length batches with distributed training and IterableDataset

ssharpe42 · October 2, 2023, 6:54pm

Anyone have a good solution for this? Opened an issue here:

Trainer errors out when concatenating different sequence length batches with distributed training and IterableDataset

opened 06:49PM - 02 Oct 23 UTC

### System Info - `transformers` version: 4.33.3 - Platform: Linux-5.10.186-…179.751.amzn2.x86_64-x86_64-with-glibc2.10 - Python version: 3.8.17 - Huggingface_hub version: 0.17.3 - Safetensors version: 0.3.3 - Accelerate version: 0.23.0 - Accelerate config: not found - PyTorch version (GPU?): 2.0.1+cu117 (True) - Tensorflow version (GPU?): not installed (NA) - Flax version (CPU?/GPU?/TPU?): not installed (NA) - Jax version: not installed - JaxLib version: not installed - Using GPU in script?: A100 - Using distributed or parallel set-up in script?: torchrun --nproc-per-node 2 script.py ### Who can help? @muellerzr, @pacman100 ### Information - [ ] The official example scripts - [X] My own modified scripts ### Tasks - [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...) - [X] My own task or dataset (give details below) ### Reproduction ``` import torch from torch.utils.data import IterableDataset from transformers import ( AutoModelForMaskedLM, AutoTokenizer, DataCollatorForLanguageModeling, Trainer, TrainingArguments, ) data = [ { "input_ids": torch.tensor([101, 2040, 2001, 1999, 14936, 102]), "token_type_ids": torch.tensor([0, 0, 0, 0, 0, 0]), "attention_mask": torch.tensor([1, 1, 1, 1, 1, 1]), }, { "input_ids": torch.tensor([101, 2040, 102]), "token_type_ids": torch.tensor([0, 0, 0]), "attention_mask": torch.tensor([1, 1, 1]), }, { "input_ids": torch.tensor([101, 2040, 2001, 1999]), "token_type_ids": torch.tensor([0, 0, 0, 0]), "attention_mask": torch.tensor([1, 1, 1, 1]), }, { "input_ids": torch.tensor([101, 2040, 2001, 1999, 14936, 102]), "token_type_ids": torch.tensor([0, 0, 0, 0, 0, 0]), "attention_mask": torch.tensor([1, 1, 1, 1, 1, 1]), }, { "input_ids": torch.tensor([101]), "token_type_ids": torch.tensor([00]), "attention_mask": torch.tensor([1]), }, { "input_ids": torch.tensor([101]), "token_type_ids": torch.tensor([00]), "attention_mask": torch.tensor([1]), }, ] class ExampleDataset(IterableDataset): def __init__(self, data): super().__init__() self.data = data * 20 def __iter__(self): for x in self.data: yield x def __len__(self): return len(self.data) tokenizer = AutoTokenizer.from_pretrained("bert-base-cased") model = AutoModelForMaskedLM.from_pretrained("bert-base-cased") train_args = TrainingArguments( output_dir="output", num_train_epochs=3, per_device_train_batch_size=2, ) dc = DataCollatorForLanguageModeling(tokenizer=tokenizer) trainer = Trainer( train_dataset=ExampleDataset(data), model=model, args=train_args, data_collator=dc, ) trainer.train() ``` I run the above script with the command `torchrun --nproc-per-node 2 script.py`. This results in the following error. ``` Traceback (most recent call last): File "fm_model/data/scratch.py", line 242, in <module> trainer.train() File "/opt/conda/envs/fmmodel/lib/python3.8/site-packages/transformers/trainer.py", line 1556, in train return inner_training_loop( File "/opt/conda/envs/fmmodel/lib/python3.8/site-packages/transformers/trainer.py", line 1816, in _inner_training_loop for step, inputs in enumerate(epoch_iterator): File "/opt/conda/envs/fmmodel/lib/python3.8/site-packages/accelerate/data_loader.py", line 597, in __iter__ next_batch, next_batch_info = self._fetch_batches(main_iterator) File "/opt/conda/envs/fmmodel/lib/python3.8/site-packages/accelerate/data_loader.py", line 528, in _fetch_batches batch = concatenate(batches, dim=0) File "/opt/conda/envs/fmmodel/lib/python3.8/site-packages/accelerate/utils/operations.py", line 496, in concatenate return type(data[0])({k: concatenate([d[k] for d in data], dim=dim) for k in data[0].keys()}) File "/opt/conda/envs/fmmodel/lib/python3.8/site-packages/accelerate/utils/operations.py", line 496, in <dictcomp> return type(data[0])({k: concatenate([d[k] for d in data], dim=dim) for k in data[0].keys()}) File "/opt/conda/envs/fmmodel/lib/python3.8/site-packages/accelerate/utils/operations.py", line 499, in concatenate return torch.cat(data, dim=dim) RuntimeError: Sizes of tensors must match except in dimension 0. Expected size 1 but got size 6 for tensor number 1 in the list. ``` This is due to the fact that in `Trainer` there are no arguments that can be passed to prepare the dataloader with [split_batches](https://github.com/huggingface/accelerate/blob/48d96319e0033fb8c8979072d97edf3995639029/src/accelerate/data_loader.py#L515) so this errors out when running this [line](https://github.com/huggingface/accelerate/blob/69e4c3c54da3201eda288b500d138761e7a5221c/src/accelerate/data_loader.py#L481). This occurs since there is no padding done across batches before these are concatenated together. In order to be able to use an iterable dataset with Trainer, something probably needs to be changed in accelerate or the Trainer to enable distributed dataloading when the batches end up being different lengths. ### Expected behavior 1. Automatic padding in accelerate when the batches produced have different lengths OR 2. A way to specify split_batches where a full batch is produced then split for all the different processes

Topic		Replies	Views
Using huggingface transformers trainer method for hugging face datasets 🤗Datasets	1	1095	April 15, 2024
How to use Huggingface Trainer streaming Datasets without wrapping it with torchdata's IterableWrapper? 🤗Datasets	1	4568	October 30, 2022
Some unintended things happen in Seq2SeqTrainer example 🤗Transformers	3	1582	November 30, 2020
Boilerplate for Trainer using torch.distributed Beginners	4	2038	January 11, 2022
Getting error while resuming the training with a single GPU Beginners	1	744	June 13, 2024

Trainer errors out when concatenating different sequence length batches with distributed training and IterableDataset

Related topics