HuggingFace dataset: each element in list of batch should be of equal size

I’m trying to use HuggingFace’s tokenizers and datasets with a PyTorch dataloader, like so:

            dataset = load_dataset(
                'wikitext',
                'wikitext-2-raw-v1',
                split='train[:5%]',  # take only first 5% of the dataset
                cache_dir=cache_dir)

            tokenized_dataset = dataset.map(
                lambda e: self.tokenizer(e['text'],
                                         padding=True,
                                         max_length=512,
                                         # padding='max_length',
                                         truncation=True),
                batched=True)

with a dataloader:

        dataloader = torch.utils.data.DataLoader(
            dataset=tokenized_dataset,
            batch_size=batch_size,
            shuffle=True)

But the dataloader throws the following error:

  File "/home/rschaef/CoCoSci-Language-Distillation/distillation_v2/ratchet_learning/train.py", line 139, in run_epoch
    for batch_idx, batch in enumerate(task.dataloader):
  File "/home/rschaef/CoCoSci-Language-Distillation/cocosci/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 435, in __next__
    data = self._next_data()
  File "/home/rschaef/CoCoSci-Language-Distillation/cocosci/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 475, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "/home/rschaef/CoCoSci-Language-Distillation/cocosci/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 47, in fetch
    return self.collate_fn(data)
  File "/home/rschaef/CoCoSci-Language-Distillation/cocosci/lib/python3.6/site-packages/torch/utils/data/_utils/collate.py", line 73, in default_collate
    return {key: default_collate([d[key] for d in batch]) for key in elem}
  File "/home/rschaef/CoCoSci-Language-Distillation/cocosci/lib/python3.6/site-packages/torch/utils/data/_utils/collate.py", line 73, in <dictcomp>
    return {key: default_collate([d[key] for d in batch]) for key in elem}
  File "/home/rschaef/CoCoSci-Language-Distillation/cocosci/lib/python3.6/site-packages/torch/utils/data/_utils/collate.py", line 81, in default_collate
    raise RuntimeError('each element in list of batch should be of equal size')
RuntimeError: each element in list of batch should be of equal size

Why is this happening and how do I prevent it from happening?

With these tokenizer call parameters (i.e. padding=True which is equivalent to padding='longest') your inputs are padded to the longest sequence within the batch passed to it. Since the longest sequence differs from batch to batch, lengths of tokenized batches also differ and that’s what the Data Loader complains about.

The easiest way to solve this is to set padding to 'max_length'. Alternatively, If you want to pad to the batch’s longest sequence after all, you’ll need to move padding and truncation to a custom collate function that you need to pass to the Data Loader.

An example collate function may look like this in your case:

def collate_tokenize(data):
  text_batch = [element["text"] for element in data]
  tokenized = tokenizer(text_batch, padding='longest', truncation=True, return_tensors='pt')
  return tokenized

Then you pass to the dataloader:

dataloader = torch.utils.data.DataLoader(
    dataset=dataset,
    batch_size=batch_size,
    shuffle=True,
    collate_fn=collate_tokenize
    )

Also, here’s a somewhat outdated article that has an example of collate function.