HuggingFace dataset: each element in list of batch should be of equal size

RylanSchaeffer · October 11, 2021, 10:42pm

I’m trying to use HuggingFace’s tokenizers and datasets with a PyTorch dataloader, like so:

            dataset = load_dataset(
                'wikitext',
                'wikitext-2-raw-v1',
                split='train[:5%]',  # take only first 5% of the dataset
                cache_dir=cache_dir)

            tokenized_dataset = dataset.map(
                lambda e: self.tokenizer(e['text'],
                                         padding=True,
                                         max_length=512,
                                         # padding='max_length',
                                         truncation=True),
                batched=True)

with a dataloader:

        dataloader = torch.utils.data.DataLoader(
            dataset=tokenized_dataset,
            batch_size=batch_size,
            shuffle=True)

But the dataloader throws the following error:

  File "/home/rschaef/CoCoSci-Language-Distillation/distillation_v2/ratchet_learning/train.py", line 139, in run_epoch
    for batch_idx, batch in enumerate(task.dataloader):
  File "/home/rschaef/CoCoSci-Language-Distillation/cocosci/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 435, in __next__
    data = self._next_data()
  File "/home/rschaef/CoCoSci-Language-Distillation/cocosci/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 475, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "/home/rschaef/CoCoSci-Language-Distillation/cocosci/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 47, in fetch
    return self.collate_fn(data)
  File "/home/rschaef/CoCoSci-Language-Distillation/cocosci/lib/python3.6/site-packages/torch/utils/data/_utils/collate.py", line 73, in default_collate
    return {key: default_collate([d[key] for d in batch]) for key in elem}
  File "/home/rschaef/CoCoSci-Language-Distillation/cocosci/lib/python3.6/site-packages/torch/utils/data/_utils/collate.py", line 73, in <dictcomp>
    return {key: default_collate([d[key] for d in batch]) for key in elem}
  File "/home/rschaef/CoCoSci-Language-Distillation/cocosci/lib/python3.6/site-packages/torch/utils/data/_utils/collate.py", line 81, in default_collate
    raise RuntimeError('each element in list of batch should be of equal size')
RuntimeError: each element in list of batch should be of equal size

Why is this happening and how do I prevent it from happening?

adorkin · October 12, 2021, 8:45am

With these tokenizer call parameters (i.e. padding=True which is equivalent to padding='longest') your inputs are padded to the longest sequence within the batch passed to it. Since the longest sequence differs from batch to batch, lengths of tokenized batches also differ and that’s what the Data Loader complains about.

The easiest way to solve this is to set padding to 'max_length'. Alternatively, If you want to pad to the batch’s longest sequence after all, you’ll need to move padding and truncation to a custom collate function that you need to pass to the Data Loader.

adorkin · October 12, 2021, 9:22am

An example collate function may look like this in your case:

def collate_tokenize(data):
  text_batch = [element["text"] for element in data]
  tokenized = tokenizer(text_batch, padding='longest', truncation=True, return_tensors='pt')
  return tokenized

Then you pass to the dataloader:

dataloader = torch.utils.data.DataLoader(
    dataset=dataset,
    batch_size=batch_size,
    shuffle=True,
    collate_fn=collate_tokenize
    )

Also, here’s a somewhat outdated article that has an example of collate function.

brando · August 10, 2023, 9:52pm

@adorkin curious, do you know if this work if you passed the collate function to the trainer object?

Topic		Replies	Views
Tensorflow Huggingface Datasets Equivalent to PyTorch 🤗Datasets	2	996	June 27, 2022
How does one create a pytoch data loader using an interleaved hugging face dataset? Beginners	3	1427	August 18, 2023
Dataloader time problem on custom dataset based on huggingface Beginners	2	962	June 14, 2022
Gpt2 model training , Loss nan Intermediate	0	328	July 10, 2023
Loading a dataset doesn't actually memory map 🤗Datasets	1	786	September 4, 2023

HuggingFace dataset: each element in list of batch should be of equal size

Related topics