Quick Tour: "Train using Tensorflow" gives `Dataset argument should be a datasets.Dataset` error

The end of the :hugs:Hugging Face Quick Tour guide has a section on how to Train with Tensorflow. The code amounts to:

# Load a data set
from datasets import load_dataset
dataset = load_dataset("rotten_tomatoes")

# Load a classifier
from transformers import TFAutoModelForSequenceClassification
model = TFAutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased")

# Load the corresponding tokenizer
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

# Apply the tokenizer over the data set
def tokenize_dataset(dataset):
    return tokenizer(dataset["text"])

dataset = dataset.map(tokenize_dataset)

# Prepare the data set for use
tf_dataset = model.prepare_tf_dataset(
    dataset, batch_size=16, shuffle=True, tokenizer=tokenizer
)

from tensorflow.keras.optimizers import Adam

model.compile(optimizer=Adam(3e-5))
model.fit(dataset)

The final line - preparing the dataset:

tf_dataset = model.prepare_tf_dataset(dataset, ...

throws the error:

TypeError: Dataset argument should be a datasets.Dataset!

Research Effort

The only mention of this exception anywhere on the Internet is in the :hugs:Hugging Face github source:

if not isinstance(dataset, datasets.Dataset):
    raise TypeError("Dataset argument should be a datasets.Dataset!")

So what am i doing wrong? Or, alternatively, what do i have to do differently.

Bonus Chatter

I tried looking at the TFAutoModelForSequenceClassification.from_pretrained documentation says that it will return a TFDistilBertForSequenceClassification class.

I wanted to try to figure out what kind of type that the method supports, but the documentation for TFDistilBertForSequenceClassification doesn’t include any documentation of the methods (e.g. no compile, fit, prepare_tf_dataset).

1 Like

I got the same issue, do you figure it out the solution?

Hi! To avoid this error, you need to pass a single split (a Dataset object) instead of all the splits at once (DatasetDict):

tf_dataset = model.prepare_tf_dataset(
    dataset["train"], batch_size=16, shuffle=True, tokenizer=tokenizer
)

cc @Rocketknight1 for fixing this in the docs

Sorry for the delay! Thanks for pointing this out - we’ve opened a PR to fix this in the docs now: Fix TF example in quicktour by Rocketknight1 · Pull Request #22960 · huggingface/transformers · GitHub

1 Like

The PyTorch version doesn’t work, either, giving a different error. Hopefully you can fix this one, too. It doesn’t inspire confidence when the “QuickStart” doesn’t actually work!