My goal is to use a dataset I tokenized as input to a model, let it batch generate the output text and then decode it to get outputs as strings. I have written some code but this throws an AttributeError when I provide the dataset as input and I can’t seem to find how do this right. I saw that you can just provide a tokenized list instead as input. Is there no way to do this with datasets directly or am I missing something?
Here is what I do:
I have a dataset that was initialized from a dataframe and has the following outline:
Dataset({
features: ['question', 'answer'],
num_rows: 500
})
where question and answer are both strings.
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
from datasets import Dataset
checkpoint = "bigscience/mt0-xxl"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint)
tokenized_ds = tokenize(tokenizer, dataset)
outputs = model.generate(tokenized_ds)
The tokenize function is defined as follows:
def tokenize(tok, ds):
def tokenize_fn(sample):
result = tok(sample['question'], padding=True, return_tensors='pt')
return result
tokenized = ds.map(
tokenize_fn, batched=True, remove_columns=['question', 'answer']
)
return tokenized
I get the following error:
“AttributeError: ‘Dataset’ object has no attribute ‘dtype’”
How can I use my dataset as input for my model?