Hi!
I am trying to get some dataset to work with Pythia, but am currently failing.
I am doing the following:
# Load modules
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
import numpy as np
from datasets import load_dataset, Value
# Download pretrained
model_name = "EleutherAI/pythia-70m"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
ds = load_dataset("roneneldan/TinyStories")
small_dataset = ds['train'].select(range(1000))
BATCH_SIZE = 20
tokenizer.pad_token = tokenizer.eos_token
device = 'cuda' if torch.cuda.is_available() == True else 'cpu'
encoded_ds = small_dataset.map(
lambda examples: tokenizer(examples['text'], padding=True),
batch_size=BATCH_SIZE,
batched=True
).remove_columns('text').with_format("pt", device=device)
but then when I try to run the model, something goes wrong!
I do:
with torch.no_grad():
model_output = model(**encoded_ds, output_hidden_states=True)
which I don’t see why it is wrong. I get a rather long message, but overall the error is
argument after ** must be a mapping, not Dataset
I really don’t see why this would be the case!