Quick Tour: "Train using Tensorflow" gives `Dataset argument should be a datasets.Dataset` error

JoeBethersonton · March 12, 2023, 10:47pm

The end of the Hugging Face Quick Tour guide has a section on how to Train with Tensorflow. The code amounts to:

# Load a data set
from datasets import load_dataset
dataset = load_dataset("rotten_tomatoes")

# Load a classifier
from transformers import TFAutoModelForSequenceClassification
model = TFAutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased")

# Load the corresponding tokenizer
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

# Apply the tokenizer over the data set
def tokenize_dataset(dataset):
    return tokenizer(dataset["text"])

dataset = dataset.map(tokenize_dataset)

# Prepare the data set for use
tf_dataset = model.prepare_tf_dataset(
    dataset, batch_size=16, shuffle=True, tokenizer=tokenizer
)

from tensorflow.keras.optimizers import Adam

model.compile(optimizer=Adam(3e-5))
model.fit(dataset)

The final line - preparing the dataset:

tf_dataset = model.prepare_tf_dataset(dataset, ...

throws the error:

TypeError: Dataset argument should be a datasets.Dataset!

Research Effort

The only mention of this exception anywhere on the Internet is in the Hugging Face github source:

if not isinstance(dataset, datasets.Dataset):
    raise TypeError("Dataset argument should be a datasets.Dataset!")

So what am i doing wrong? Or, alternatively, what do i have to do differently.

Bonus Chatter

I tried looking at the TFAutoModelForSequenceClassification.from_pretrained documentation says that it will return a TFDistilBertForSequenceClassification class.

I wanted to try to figure out what kind of type that the method supports, but the documentation for TFDistilBertForSequenceClassification doesn’t include any documentation of the methods (e.g. no compile, fit, prepare_tf_dataset).

jeff1948 · April 11, 2023, 6:37pm

I got the same issue, do you figure it out the solution?

mariosasko · April 13, 2023, 2:18pm

Hi! To avoid this error, you need to pass a single split (a Dataset object) instead of all the splits at once (DatasetDict):

tf_dataset = model.prepare_tf_dataset(
    dataset["train"], batch_size=16, shuffle=True, tokenizer=tokenizer
)

cc @Rocketknight1 for fixing this in the docs

Rocketknight1 · April 24, 2023, 12:54pm

Sorry for the delay! Thanks for pointing this out - we’ve opened a PR to fix this in the docs now: Fix TF example in quicktour by Rocketknight1 · Pull Request #22960 · huggingface/transformers · GitHub

jonbirge · May 29, 2023, 2:48am

The PyTorch version doesn’t work, either, giving a different error. Hopefully you can fix this one, too. It doesn’t inspire confidence when the “QuickStart” doesn’t actually work!

Topic		Replies	Views
Use tf.data.Data with HuggingFace datasets 🤗Transformers	2	2644	March 22, 2021
Dataset object has no attribute `to_tf_dataset` Course	6	9491	July 8, 2023
Type of dataset in Trainer class Beginners	3	2476	July 20, 2020
Cannot encode/tokenize my Dataset Dictionary Beginners	1	1089	August 19, 2021
Having an issue with 'NoneType' after using to_df_dataset() function Beginners	3	3141	January 13, 2024

Quick Tour: "Train using Tensorflow" gives `Dataset argument should be a datasets.Dataset` error

Research Effort

Bonus Chatter

Related topics