The end of the
Hugging Face Quick Tour guide has a section on how to Train with Tensorflow. The code amounts to:
# Load a data set
from datasets import load_dataset
dataset = load_dataset("rotten_tomatoes")
# Load a classifier
from transformers import TFAutoModelForSequenceClassification
model = TFAutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased")
# Load the corresponding tokenizer
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
# Apply the tokenizer over the data set
def tokenize_dataset(dataset):
return tokenizer(dataset["text"])
dataset = dataset.map(tokenize_dataset)
# Prepare the data set for use
tf_dataset = model.prepare_tf_dataset(
dataset, batch_size=16, shuffle=True, tokenizer=tokenizer
)
from tensorflow.keras.optimizers import Adam
model.compile(optimizer=Adam(3e-5))
model.fit(dataset)
The final line - preparing the dataset:
tf_dataset = model.prepare_tf_dataset(dataset, ...
throws the error:
TypeError: Dataset argument should be a datasets.Dataset!
Research Effort
The only mention of this exception anywhere on the Internet is in the
Hugging Face github source:
if not isinstance(dataset, datasets.Dataset):
raise TypeError("Dataset argument should be a datasets.Dataset!")
So what am i doing wrong? Or, alternatively, what do i have to do differently.
Bonus Chatter
I tried looking at the TFAutoModelForSequenceClassification.from_pretrained documentation says that it will return a TFDistilBertForSequenceClassification class.
I wanted to try to figure out what kind of type that the method supports, but the documentation for TFDistilBertForSequenceClassification doesn’t include any documentation of the methods (e.g. no compile, fit, prepare_tf_dataset).