Issues formatting Dataset to PyTorch Pythia LM

mascaretti · July 18, 2024, 2:02pm

Hi!
I am trying to get some dataset to work with Pythia, but am currently failing.

I am doing the following:

# Load modules
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
import numpy as np
from datasets import load_dataset, Value

# Download pretrained
model_name = "EleutherAI/pythia-70m"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

ds = load_dataset("roneneldan/TinyStories")

small_dataset = ds['train'].select(range(1000))

BATCH_SIZE = 20

tokenizer.pad_token = tokenizer.eos_token
device = 'cuda' if torch.cuda.is_available() == True else 'cpu'

encoded_ds = small_dataset.map(
    lambda examples: tokenizer(examples['text'], padding=True),
    batch_size=BATCH_SIZE,
    batched=True
).remove_columns('text').with_format("pt", device=device)

but then when I try to run the model, something goes wrong!

I do:

with torch.no_grad():
    model_output = model(**encoded_ds, output_hidden_states=True)

which I don’t see why it is wrong. I get a rather long message, but overall the error is
argument after ** must be a mapping, not Dataset

I really don’t see why this would be the case!

Topic		Replies	Views
How to use Dataset with Pytorch Lightning 🤗Datasets	1	4141	April 13, 2021
Dataset set_format error - ValueError: PyTorch needs to be installed Beginners	2	823	May 10, 2025
Set dataset to pytorch tensors produce class list making the model unable to process the data 🤗Datasets	3	2457	July 20, 2021
What is wrong with my code Beginners	0	40	October 22, 2024
Problems when using PyTorch Class Dataset in model fineturn Beginners	0	220	July 12, 2023

Issues formatting Dataset to PyTorch Pythia LM

Related topics