Hello.
I’m trying to train a GPT2 model (actually GPT2LMHeadModel) using tensorflow2.
I found this post where the author shows how to do it with great detail. However, there’s a special aspect regarding the definition of the loss function when compiling the model.
Initially, when I was trying to implement it on my own, I defined the loss function as usual for the Keras models:
model.compile(optimizer=optimizer, loss=loss_fn, metrics=[metric])
However, when trying to execute the fit
method, it threw an error:
history = model.fit(dataset, epochs=EPOCHS)
ValueError: Shape mismatch: The shape of labels (received (490,)) should equal the shape of logits except for the last dimension (received (11760, 64))
In this case, the dateset corresponds to a tensorflow dataset, which shape is <TakeDataset shapes: ((10, 49), (10, 49)), types: (tf.int32, tf.int32)>
(Batch size: 10, sequences length: 49).
After that, I realized the loss function parameter is defined differently in the tutorial mentioned above. The loss parameter corresponds to a list, which actually happens to be a function followed by a collection of None values, depending on the number of layers defined for the model architecture. Under this approach, it works fine.
In both cases, the loss function corresponds to tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
.
The documentation for TFGPT2LMHeadModel
specifies: “The GPT2 Model transformer with a language modeling head on top (linear layer with weights tied to the input embeddings).
”. So, I think it makes sense to define the loss as that list-based parameter (Basically the loss function would work for the top layer).
After seeing this, I have some questions:
- Is it correct to define the loss function that way?
- Does it have any implications for the inference process?
Additionally, I would like to mention that I’ve also tried to train the model using the Trainer class, unfortunately, it throws a similar error when running the train
method.
training_args = TFTrainingArguments(
output_dir='./results', # output directory
num_train_epochs=EPOCHS, # total # of training epochs
per_device_train_batch_size=10, # batch size per device during training
per_device_eval_batch_size=10, # batch size for evaluation
warmup_steps=500, # number of warmup steps for learning rate scheduler
weight_decay=0.01, # strength of weight decay
logging_dir='./logs', # directory for storing logs)
trainer = TFTrainer(model=model, args=training_args, train_dataset=dataset, eval_dataset=dataset)
trainer.train()
ValueError: Shapes (248040,) and (4410,) are incompatible
I’m using transformers v 3.5.0 and tokenizers v 0.9.3
Sorry for the long post, thanks in advance for your help!