How does one create a pytoch data loader using an interleaved hugging face dataset?

brando · August 10, 2023, 6:22pm

When I interleave data sets, get a tokenized batch, feed the batch to the pytorch data loader, I get errors:

# -*- coding: utf-8 -*-
"""issues with dataloader and custom data sets

Automatically generated by Colaboratory.

Original file is located at
    https://colab.research.google.com/drive/1sbs95as_66mtK9VK_vbaE9gLE-Tjof1-
"""

!pip install datasets
!pip install pytorch
!pip install transformers

token = None
batch_size = 10
from datasets import load_dataset
import torch
from transformers import GPT2Tokenizer, GPT2LMHeadModel

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
if tokenizer.pad_token_id is None:
  tokenizer.pad_token = tokenizer.eos_token
probe_network = GPT2LMHeadModel.from_pretrained("gpt2")
device = torch.device(f"cuda:{0}" if torch.cuda.is_available() else "cpu")
probe_network = probe_network.to(device)

# -- Get batch from dataset
from datasets import load_dataset
# path, name = 'brando/debug1_af', 'debug1_af'
path, name = 'brando/debug0_af', 'debug0_af'
remove_columns = []
dataset = load_dataset(path, name, streaming=True, split="train", token=token).with_format("torch")
print(f'{dataset=}')
batch = dataset.take(batch_size)
# print(f'{next(iter(batch))=}')

# - Prepare functions to tokenize batch
def preprocess(examples):  # gets the raw text batch according to the specific names in table in data set & tokenize
    return tokenizer(examples["link"], padding="max_length", max_length=128, truncation=True, return_tensors="pt")
def map(batch):  # apply preprocess to batch to all examples in batch represented as a dataset
    return batch.map(preprocess, batched=True, remove_columns=remove_columns)
tokenized_batch = batch.map(preprocess, batched=True, remove_columns=remove_columns)
tokenized_batch = map(batch)
# print(f'{next(iter(tokenized_batch))=}')

from torch.utils.data import Dataset, DataLoader, SequentialSampler
dataset = tokenized_batch
print(f'{type(dataset)=}')
print(f'{dataset.__class__=}')
print(f'{isinstance(dataset, Dataset)=}')
# for i, d in enumerate(dataset):
#     assert isinstance(d, dict)
#     # dd = dataset[i]
#     # assert isinstance(dd, dict)
loader_opts = {}
classifier_opts = {}
# data_loader = DataLoader(dataset, shuffle=False, batch_size=loader_opts.get('batch_size', 1),
#                         num_workers=loader_opts.get('num_workers', 0), drop_last=False, sampler=SequentialSampler(range(512))  )
data_loader = DataLoader(dataset, shuffle=False, batch_size=loader_opts.get('batch_size', 1),
                    num_workers=loader_opts.get('num_workers', 0), drop_last=False, sampler=None)
print(f'{iter(data_loader)=}')
print(f'{next(iter(data_loader))=}')
print('Done\a')

with error:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/collate.py in collate(batch, collate_fn_map)
    126         try:
--> 127             return elem_type({key: collate([d[key] for d in batch], collate_fn_map=collate_fn_map) for key in elem})
    128         except TypeError:

9 frames
TypeError: default_collate: batch must contain tensors, numpy arrays, numbers, dicts or lists; found <class 'NoneType'>

During handling of the above exception, another exception occurred:

TypeError                                 Traceback (most recent call last)
/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/collate.py in collate(batch, collate_fn_map)
    148                 return [collate(samples, collate_fn_map=collate_fn_map) for samples in transposed]
    149 
--> 150     raise TypeError(default_collate_err_msg_format.format(elem_type))
    151 
    152 

TypeError: default_collate: batch must contain tensors, numpy arrays, numbers, dicts or lists; found <class 'NoneType'>

why? And why doesn’t the single data set c4 and wiki-text give this error? Only interleaved data sets?

Ideally I don’t want to write my own collate_function.

brando · August 10, 2023, 6:41pm

similar error: How does one create a pytorch data loader with a custom hugging face data set without having errors? - #2 by brando

brando · August 10, 2023, 7:23pm

lhoestq · August 18, 2023, 2:49pm

Make sure the datastes you’re interleaving have the same columns, otherwise the resulting dataset might contain None for missing data.

Feel free to rename or remove columns if needed.

Topic		Replies	Views
How does one create a pytorch data loader with a custom hugging face data set without having errors? Beginners	3	3848	August 14, 2023
How to use huggingface HF trainer train with custom collate function? Beginners	10	4398	August 21, 2023
HuggingFace dataset: each element in list of batch should be of equal size 🤗Datasets	3	10377	August 10, 2023
Dataloader time problem on custom dataset based on huggingface Beginners	2	1029	June 14, 2022
Tensorflow Huggingface Datasets Equivalent to PyTorch 🤗Datasets	2	1044	June 27, 2022

How does one create a pytoch data loader using an interleaved hugging face dataset?

Related topics