I am having issues loading my data in the right shape/format for my model and am not seeing much clarity in the docs (esp. w/ streaming datasets). What I would like to do is preprocess some text and return a 32xN batch for fine tuning GPT. My setup is roughly the following:
tokenizer = AutoTokenizer.from_pretrained("gpt2")
...
# Set up the DataLoader for the training data
train_dataset = load_dataset(
"huggingface-course/codeparrot-ds-train", streaming=True, split="train"
)
train_dataset = train_dataset.map(
lambda x: tokenizer(x["content"], truncation=True, padding="max_length"),
batched=True,
remove_columns=["content", "repo_name", "path", "copies", "size", "license"],
)
train_dataset = train_dataset.with_format("torch")
train_loader = DataLoader(train_dataset, batch_size=32)
However once the data arrives in my model I see that my shape is 1024x32 and it is wrapped as a list rather than a 32x1024 tensor which I can process.
def training_step(self, batch, batch_idx):
# Get the inputs from the batch.
input_ids = batch["input_ids"] # now [list[tensor(batch_size)]] ???
attention_mask = batch["attention_mask"]
# Compute the logits.
logits = self(input_ids, attention_mask)
Am I misusing the load_dataset
and DataLoader
APIs? Or what is the best way to resolve this?