Hi NLP Gurus,
I recently go trough the brand new Hugging Face course and decide to pick a project from the project list: Personal Writing Assistant. In this project, Lewis propose to use T5 and the JFleg datasets. I struggle a lot to have something close to work but I’m block at the training stage. Important point: I’m working on a M1 Mac and so I must use Tensorflow.
First issue: the to_tf_dataset
coupled with DataCollatorForSeq2Seq
have a strange behaviour. DataCollatorForSeq2Seq
should use the T5 model to create decoder_input_ids
using prepare_decoder_input_ids_from_labels
model on the labels
. But because the column didn’t exists at first to_tf_dataset
drop it. If I add it in the columns
params of to_tf_dataset
, an error is raised because the column didn’t yet exists. I finally end up creating a dummy column fill with zeros to make it work. I think we can improve the developer experience here. Note that the course example have the same issue on Google Colab: batch["decoder_input_ids"]
show the tensor but it doesn’t appear in the tf_train_dataset
.
Second blocking issue: when I call the fit
method on the Keras model, an error raise
Invalid argument: logits and labels must have the same first dimension, got logits shape [3840,64] and labels shape [480]
[[node sparse_categorical_crossentropy_3/SparseSoftmaxCrossEntropyWithLogits/SparseSoftmaxCrossEntropyWithLogits (defined at opt/homebrew/Caskroom/miniforge/base/envs/tensorflow/lib/python3.9/site-packages/transformers/modeling_tf_utils.py:797) ]]
[[tf_t5for_conditional_generation/decoder/block_._2/layer_._0/SelfAttention/transpose_1/_514]]
I personally can’t deal with this kind of error so if someone can help, I will appreciate !
This is my notebook:
import tensorflow as tf
import numpy as np
from datasets import load_dataset, concatenate_datasets, Dataset
from transformers import AutoTokenizer, TFAutoModelForSeq2SeqLM, DataCollatorForSeq2Seq, create_optimizer
dataset = load_dataset('jfleg')
dataset = concatenate_datasets([dataset['validation'], dataset['test']])
dataset = dataset.filter(lambda x: len(x['sentence']) > 16)
pd_dataset = dataset.to_pandas()
pd_dataset = pd_dataset.explode('corrections', ignore_index=True)
dataset = Dataset.from_pandas(pd_dataset)
dataset = dataset.map(lambda x: {'correction': x['corrections'], 'sentence': 'grammar:' + x['sentence']})
dataset = dataset.remove_columns(['corrections'])
def preprocess(examples):
model_inputs = tokenizer(examples['sentence'], max_length=128, truncation=True)
with tokenizer.as_target_tokenizer():
labels = tokenizer(examples['correction'], max_length=128, truncation=True)
model_inputs['labels'] = labels['input_ids']
model_inputs['decoder_input_ids'] = np.zeros((len(labels['input_ids']), 0))
return model_inputs
inputs = dataset.map(preprocess, batched=True)
inputs = inputs.remove_columns(['sentence', 'correction'])
model = TFAutoModelForSeq2SeqLM.from_pretrained('t5-small')
batch_size = 8
num_epochs = 3
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model, return_tensors="tf")
tf_train = inputs.to_tf_dataset(
columns=["attention_mask", "input_ids", 'decoder_input_ids'],
label_cols=["labels"],
shuffle=True,
collate_fn=data_collator,
batch_size=batch_size,
)
num_train_steps = len(tf_train) * num_epochs
optimizer, schedule = create_optimizer(
init_lr=5e-5,
num_warmup_steps=0,
num_train_steps=num_train_steps,
weight_decay_rate=0.01,
)
model.compile(
optimizer=optimizer,
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
metrics=tf.metrics.SparseCategoricalAccuracy(),
)
model.fit(
tf_train,
epochs=num_epochs,
batch_size=batch_size
)