I want to do text generation and I’m trying to use gpt-2 model and tokenizer for that purpose.
I am the stage of adding dynamic padding to the dataset using data loader but can’t iterate through the data loader object after dynamic padding is added (i’m assuming) and it is giving this error:
Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length.
when I try to iterate it via this line of code:
for step, batch in enumerate(train_dataloader):
print(batch)
if step>5:
break
Full code:
import pandas as pd
from datasets import load_dataset
from transformers import GPT2TokenizerFast
from transformers import DataCollatorWithPadding
from torch.utils.data import DataLoader
import torch
raw_datasets = load_dataset("csv", data_files="dataset.csv", sep=";")
raw_datasets
raw_train_datasets = raw_datasets["train"]
raw_train_datasets
checkpoint = "gpt2"
tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token
def tokenize_function(example):
return tokenizer(example["sentence"], truncation=True)
tokenized_datasets = raw_train_datasets.map(tokenize_function, batched=True)
tokenized_datasets = tokenized_datasets.remove_columns(["sentence"])
tokenized_datasets = tokenized_datasets.with_format("torch")
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
train_dataloader = DataLoader(tokenized_datasets["input_ids"], batch_size=16, shuffle=True, collate_fn=data_collator)
train_dataloader
for step, batch in enumerate(train_dataloader):
print(batch)
if step>5:
break
Please help, i’m trying to debug but can’t get anywhere…