Hello,
I am currently working on a classification problem using ProtBERT and I am following the Fine-Tuning Tutorial. I have called the tokenised using
Unfortunately, the model doesn’t seem to be learning (I froze the BERT layers). From reading around, I saw that I need to add the [CLS] token and found such an option using
tokenised.encode(add_special_tokens=True)
Yet the tutorial I am following doesn’t seem to require and I was wondering wyy is there a discrepancy and perhaps maybe this is why my model isn’t learning.
You can also see it by encoding a text and then decoding it:
text = "I love Adelaide!"
# add_special_tokens=True is set by default
text_enc = tokenizer.encode(text)
for tok in text_enc:
print(tok, tokenizer.decode(tok))
Also, I am freezing the BERT layer but fine-tuning the classification layer. The reason I am doing that is because Colab crashes as it runs out of GPU if I don’t. If I try using just the CPU, I get “Your session crashed after using all available RAM.”
I therefore add
for param in model.bert.parameters():
param.requires_grad = False
Which fixes that, but then the model doesn’t seem to learn anything
thank you @lewtun. I have gone through their implementation and noticed that their per_device_train_batch_size is 1. Once I changed that, It works and I am actually getting some amazing results (its on its final epoch now, but currently hitting 80% accuracy ).
I was hoping you could perhaps explain to me what this per_device_train_batch_size does and why was that the issue?
the per_device_train_batch_size specifies the batch size for the device you are training on (e.g. GPU/TPU/CPU), so if your training set has 1,000 examples and and per_device_train_batch_size=1 then it will take 1,000 steps to complete one epoch.
by increasing the value of per_device_train_batch_size you are able to train faster since it takes less steps to complete each epoch (e.g. if per_device_train_batch_size=4 then we only need 250 steps / epoch in our example), but this can sometimes lead to worse performance since the gradients are averaged / summed in a minibatch.
in your case, my guess is that with per_device_train_batch_size=1 you need to train for a very long time to see the model learn anything.
oh what i meant by “long” is that you may need to run for many epochs before you start seeing any flattening out of you training / validation loss (i.e “convergence”). if you saw the validation loss drop during those 3 epochs then i am not sure what else might have gone wrong in your ProtBERT example