I’m trying to train a GPT2 model (actually GPT2LMHeadModel) using tensorflow2.
I found this post where the author shows how to do it with great detail. However, there’s a special aspect regarding the definition of the loss function when compiling the model.
Initially, when I was trying to implement it on my own, I defined the loss function as usual for the Keras models:
model.compile(optimizer=optimizer, loss=loss_fn, metrics=[metric])
However, when trying to execute the
fit method, it threw an error:
history = model.fit(dataset, epochs=EPOCHS) ValueError: Shape mismatch: The shape of labels (received (490,)) should equal the shape of logits except for the last dimension (received (11760, 64))
In this case, the dateset corresponds to a tensorflow dataset, which shape is
<TakeDataset shapes: ((10, 49), (10, 49)), types: (tf.int32, tf.int32)> (Batch size: 10, sequences length: 49).
After that, I realized the loss function parameter is defined differently in the tutorial mentioned above. The loss parameter corresponds to a list, which actually happens to be a function followed by a collection of None values, depending on the number of layers defined for the model architecture. Under this approach, it works fine.
In both cases, the loss function corresponds to
The documentation for
TFGPT2LMHeadModel specifies: “
The GPT2 Model transformer with a language modeling head on top (linear layer with weights tied to the input embeddings).”. So, I think it makes sense to define the loss as that list-based parameter (Basically the loss function would work for the top layer).
After seeing this, I have some questions:
- Is it correct to define the loss function that way?
- Does it have any implications for the inference process?
Additionally, I would like to mention that I’ve also tried to train the model using the Trainer class, unfortunately, it throws a similar error when running the
training_args = TFTrainingArguments( output_dir='./results', # output directory num_train_epochs=EPOCHS, # total # of training epochs per_device_train_batch_size=10, # batch size per device during training per_device_eval_batch_size=10, # batch size for evaluation warmup_steps=500, # number of warmup steps for learning rate scheduler weight_decay=0.01, # strength of weight decay logging_dir='./logs', # directory for storing logs) trainer = TFTrainer(model=model, args=training_args, train_dataset=dataset, eval_dataset=dataset) trainer.train() ValueError: Shapes (248040,) and (4410,) are incompatible
I’m using transformers v 3.5.0 and tokenizers v 0.9.3
Sorry for the long post, thanks in advance for your help!