How can I use tokenized Dataset for Text Generation?

laurinpaech · January 22, 2023, 2:31pm

My goal is to use a dataset I tokenized as input to a model, let it batch generate the output text and then decode it to get outputs as strings. I have written some code but this throws an AttributeError when I provide the dataset as input and I can’t seem to find how do this right. I saw that you can just provide a tokenized list instead as input. Is there no way to do this with datasets directly or am I missing something?

Here is what I do:

I have a dataset that was initialized from a dataframe and has the following outline:

Dataset({
    features: ['question', 'answer'],
    num_rows: 500
})

where question and answer are both strings.

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
from datasets import Dataset

checkpoint = "bigscience/mt0-xxl"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint)

tokenized_ds = tokenize(tokenizer, dataset)
outputs = model.generate(tokenized_ds)

The tokenize function is defined as follows:

def tokenize(tok, ds):
    def tokenize_fn(sample):
        result = tok(sample['question'], padding=True, return_tensors='pt')
        return result

    tokenized = ds.map(
        tokenize_fn, batched=True, remove_columns=['question', 'answer']
    )

    return tokenized

I get the following error:
“AttributeError: ‘Dataset’ object has no attribute ‘dtype’”

How can I use my dataset as input for my model?

Topic		Replies	Views
Pass `Dataset.map` result to model Beginners	2	1056	April 4, 2023
Help understanding how to build a dataset for language as with the old TextDataset 🤗Datasets	7	12368	October 6, 2021
How to tokenize using map 🤗Datasets	4	5652	April 14, 2021
Cannot encode/tokenize my Dataset Dictionary Beginners	1	1037	August 19, 2021
How to get model output to retain \n from dataset? Beginners	0	288	July 29, 2022

How can I use tokenized Dataset for Text Generation?

Related topics