The end of the Hugging Face Quick Tour guide has a section on how to Train with Tensorflow. The code amounts to:
# Load a data set
from datasets import load_dataset
dataset = load_dataset("rotten_tomatoes")
# Load a classifier
from transformers import TFAutoModelForSequenceClassification
model = TFAutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased")
# Load the corresponding tokenizer
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
# Apply the tokenizer over the data set
def tokenize_dataset(dataset):
return tokenizer(dataset["text"])
dataset = dataset.map(tokenize_dataset)
# Prepare the data set for use
tf_dataset = model.prepare_tf_dataset(
dataset, batch_size=16, shuffle=True, tokenizer=tokenizer
)
from tensorflow.keras.optimizers import Adam
model.compile(optimizer=Adam(3e-5))
model.fit(dataset)
The final line - preparing the dataset:
tf_dataset = model.prepare_tf_dataset(dataset, ...
throws the error:
TypeError: Dataset argument should be a datasets.Dataset!
Research Effort
The only mention of this exception anywhere on the Internet is in the Hugging Face github source:
if not isinstance(dataset, datasets.Dataset):
raise TypeError("Dataset argument should be a datasets.Dataset!")
So what am i doing wrong? Or, alternatively, what do i have to do differently.
Bonus Chatter
I tried looking at the TFAutoModelForSequenceClassification.from_pretrained documentation says that it will return a TFDistilBertForSequenceClassification class.
I wanted to try to figure out what kind of type that the method supports, but the documentation for TFDistilBertForSequenceClassification doesn’t include any documentation of the methods (e.g. no compile
, fit
, prepare_tf_dataset
).