Streaming data with batch size shape issue (returning list not tensor)

maximus12793 · February 21, 2023, 3:15am

I am having issues loading my data in the right shape/format for my model and am not seeing much clarity in the docs (esp. w/ streaming datasets). What I would like to do is preprocess some text and return a 32xN batch for fine tuning GPT. My setup is roughly the following:

tokenizer = AutoTokenizer.from_pretrained("gpt2")
...

# Set up the DataLoader for the training data
train_dataset = load_dataset(
    "huggingface-course/codeparrot-ds-train", streaming=True, split="train"
)
train_dataset = train_dataset.map(
    lambda x: tokenizer(x["content"], truncation=True, padding="max_length"),
    batched=True,
    remove_columns=["content", "repo_name", "path", "copies", "size", "license"],
)
train_dataset = train_dataset.with_format("torch")
train_loader = DataLoader(train_dataset, batch_size=32)

However once the data arrives in my model I see that my shape is 1024x32 and it is wrapped as a list rather than a 32x1024 tensor which I can process.

 def training_step(self, batch, batch_idx):
        # Get the inputs from the batch.
        input_ids = batch["input_ids"] # now [list[tensor(batch_size)]] ???
        attention_mask = batch["attention_mask"]
        # Compute the logits.
        logits = self(input_ids, attention_mask)

Am I misusing the load_dataset and DataLoader APIs? Or what is the best way to resolve this?

Topic		Replies	Views
Getting correct length via DataLoader and speed 🤗Datasets	4	452	April 5, 2024
Tokenization on dataset produces invalid pytorch tensor shape 🤗Datasets	1	930	August 16, 2022
Set_format('torch') returns lists of tensors for multiple-entries sample 🤗Datasets	2	480	November 11, 2022
Streaming Dataset of Sequence Length 2048 Intermediate	7	2799	May 12, 2022
Set the format of the datasets to return pytorch tensors return list of tensors but why? Beginners	3	3876	July 13, 2021

Streaming data with batch size shape issue (returning list not tensor)

Related topics