Fine-tuning T5 on Tensorflow

Hi NLP Gurus,

I recently go trough the brand new Hugging Face course and decide to pick a project from the project list: Personal Writing Assistant. In this project, Lewis propose to use T5 and the JFleg datasets. I struggle a lot to have something close to work but I’m block at the training stage. Important point: I’m working on a M1 Mac and so I must use Tensorflow.

First issue: the to_tf_dataset coupled with DataCollatorForSeq2Seq have a strange behaviour. DataCollatorForSeq2Seq should use the T5 model to create decoder_input_ids using prepare_decoder_input_ids_from_labels model on the labels. But because the column didn’t exists at first to_tf_dataset drop it. If I add it in the columns params of to_tf_dataset, an error is raised because the column didn’t yet exists. I finally end up creating a dummy column fill with zeros to make it work. I think we can improve the developer experience here. Note that the course example have the same issue on Google Colab: batch["decoder_input_ids"] show the tensor but it doesn’t appear in the tf_train_dataset.

Second blocking issue: when I call the fit method on the Keras model, an error raise

Invalid argument:  logits and labels must have the same first dimension, got logits shape [3840,64] and labels shape [480]
	 [[node sparse_categorical_crossentropy_3/SparseSoftmaxCrossEntropyWithLogits/SparseSoftmaxCrossEntropyWithLogits (defined at opt/homebrew/Caskroom/miniforge/base/envs/tensorflow/lib/python3.9/site-packages/transformers/modeling_tf_utils.py:797) ]]
	 [[tf_t5for_conditional_generation/decoder/block_._2/layer_._0/SelfAttention/transpose_1/_514]]

I personally can’t deal with this kind of error so if someone can help, I will appreciate :hugs: !

This is my notebook:

import tensorflow as tf
import numpy as np
from datasets import load_dataset, concatenate_datasets, Dataset
from transformers import AutoTokenizer, TFAutoModelForSeq2SeqLM, DataCollatorForSeq2Seq, create_optimizer

dataset = load_dataset('jfleg')
dataset = concatenate_datasets([dataset['validation'], dataset['test']])
dataset = dataset.filter(lambda x: len(x['sentence']) > 16)

pd_dataset = dataset.to_pandas()
pd_dataset = pd_dataset.explode('corrections', ignore_index=True)
dataset = Dataset.from_pandas(pd_dataset)

dataset = dataset.map(lambda x: {'correction': x['corrections'], 'sentence': 'grammar:' + x['sentence']})
dataset = dataset.remove_columns(['corrections'])

def preprocess(examples):
  model_inputs = tokenizer(examples['sentence'], max_length=128, truncation=True)

  with tokenizer.as_target_tokenizer():
    labels = tokenizer(examples['correction'], max_length=128, truncation=True)

  model_inputs['labels'] = labels['input_ids']
  model_inputs['decoder_input_ids'] = np.zeros((len(labels['input_ids']), 0))
  return model_inputs

inputs = dataset.map(preprocess, batched=True)
inputs = inputs.remove_columns(['sentence', 'correction'])

model = TFAutoModelForSeq2SeqLM.from_pretrained('t5-small')

batch_size = 8
num_epochs = 3

data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model, return_tensors="tf")

tf_train = inputs.to_tf_dataset(
  columns=["attention_mask", "input_ids", 'decoder_input_ids'],
  label_cols=["labels"],
  shuffle=True,
  collate_fn=data_collator,
  batch_size=batch_size,
)

num_train_steps = len(tf_train) * num_epochs

optimizer, schedule = create_optimizer(
    init_lr=5e-5,
    num_warmup_steps=0,
    num_train_steps=num_train_steps,
    weight_decay_rate=0.01,
)
model.compile(
  optimizer=optimizer,
  loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
  metrics=tf.metrics.SparseCategoricalAccuracy(),
)

model.fit(
  tf_train,
  epochs=num_epochs,
  batch_size=batch_size
)

I already did a lot of research and found this:

  1. this issue but unfortunately without real answer
  2. And this one same :confused:
  1. A Snapchat notebook
  2. An this library that seems to use Hugging Face and Tensorflow behind the scene

cc @Rocketknight1

Hi @mazerte, sorry for the delay in replying! This is one of those cases where I’d actually recommend trying our new ā€œinternal lossā€ method. For more complex models like Seq2Seq, getting the right Keras losses can be hard - it’s possible, but it can require a lot of knowledge and some hacky code. Instead, just let our model compute loss for you! To do that you should do two things:

  1. Move the labels to the input dictionary so that they’re visible to the model on the forward pass, like so:
tf_train = inputs.to_tf_dataset(
  columns=["attention_mask", "input_ids", 'decoder_input_ids', 'labels'],
  shuffle=True,
  collate_fn=data_collator,
  batch_size=batch_size,
)
  1. Remove the loss argument to compile(). Note that right now, we don’t support Keras metrics when using the internal loss, but this is an area of very active development - that will hopefully change soon!
model.compile(
  optimizer=optimizer
)

If you make these two changes, your model should train successfully. We recommend this method whenever you’re not sure of which loss to use.