Fine-tuning T5 on Tensorflow

Hi NLP Gurus,

I recently go trough the brand new Hugging Face course and decide to pick a project from the project list: Personal Writing Assistant. In this project, Lewis propose to use T5 and the JFleg datasets. I struggle a lot to have something close to work but I’m block at the training stage. Important point: I’m working on a M1 Mac and so I must use Tensorflow.

First issue: the to_tf_dataset coupled with DataCollatorForSeq2Seq have a strange behaviour. DataCollatorForSeq2Seq should use the T5 model to create decoder_input_ids using prepare_decoder_input_ids_from_labels model on the labels. But because the column didn’t exists at first to_tf_dataset drop it. If I add it in the columns params of to_tf_dataset, an error is raised because the column didn’t yet exists. I finally end up creating a dummy column fill with zeros to make it work. I think we can improve the developer experience here. Note that the course example have the same issue on Google Colab: batch["decoder_input_ids"] show the tensor but it doesn’t appear in the tf_train_dataset.

Second blocking issue: when I call the fit method on the Keras model, an error raise

Invalid argument:  logits and labels must have the same first dimension, got logits shape [3840,64] and labels shape [480]
	 [[node sparse_categorical_crossentropy_3/SparseSoftmaxCrossEntropyWithLogits/SparseSoftmaxCrossEntropyWithLogits (defined at opt/homebrew/Caskroom/miniforge/base/envs/tensorflow/lib/python3.9/site-packages/transformers/ ]]

I personally can’t deal with this kind of error so if someone can help, I will appreciate :hugs: !

This is my notebook:

import tensorflow as tf
import numpy as np
from datasets import load_dataset, concatenate_datasets, Dataset
from transformers import AutoTokenizer, TFAutoModelForSeq2SeqLM, DataCollatorForSeq2Seq, create_optimizer

dataset = load_dataset('jfleg')
dataset = concatenate_datasets([dataset['validation'], dataset['test']])
dataset = dataset.filter(lambda x: len(x['sentence']) > 16)

pd_dataset = dataset.to_pandas()
pd_dataset = pd_dataset.explode('corrections', ignore_index=True)
dataset = Dataset.from_pandas(pd_dataset)

dataset = x: {'correction': x['corrections'], 'sentence': 'grammar:' + x['sentence']})
dataset = dataset.remove_columns(['corrections'])

def preprocess(examples):
  model_inputs = tokenizer(examples['sentence'], max_length=128, truncation=True)

  with tokenizer.as_target_tokenizer():
    labels = tokenizer(examples['correction'], max_length=128, truncation=True)

  model_inputs['labels'] = labels['input_ids']
  model_inputs['decoder_input_ids'] = np.zeros((len(labels['input_ids']), 0))
  return model_inputs

inputs =, batched=True)
inputs = inputs.remove_columns(['sentence', 'correction'])

model = TFAutoModelForSeq2SeqLM.from_pretrained('t5-small')

batch_size = 8
num_epochs = 3

data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model, return_tensors="tf")

tf_train = inputs.to_tf_dataset(
  columns=["attention_mask", "input_ids", 'decoder_input_ids'],

num_train_steps = len(tf_train) * num_epochs

optimizer, schedule = create_optimizer(

I already did a lot of research and found this:

  1. this issue but unfortunately without real answer
  2. And this one same :confused:
  1. A Snapchat notebook
  2. An this library that seems to use Hugging Face and Tensorflow behind the scene

cc @Rocketknight1

Hi @mazerte, sorry for the delay in replying! This is one of those cases where I’d actually recommend trying our new “internal loss” method. For more complex models like Seq2Seq, getting the right Keras losses can be hard - it’s possible, but it can require a lot of knowledge and some hacky code. Instead, just let our model compute loss for you! To do that you should do two things:

  1. Move the labels to the input dictionary so that they’re visible to the model on the forward pass, like so:
tf_train = inputs.to_tf_dataset(
  columns=["attention_mask", "input_ids", 'decoder_input_ids', 'labels'],
  1. Remove the loss argument to compile(). Note that right now, we don’t support Keras metrics when using the internal loss, but this is an area of very active development - that will hopefully change soon!

If you make these two changes, your model should train successfully. We recommend this method whenever you’re not sure of which loss to use.