Does AutoTokenizer.from_pretrained add [cls] tokens?

Hello,
I am currently working on a classification problem using ProtBERT and I am following the Fine-Tuning Tutorial. I have called the tokenised using

tokenizer = AutoTokenizer.from_pretrained

and then tokenised like the tutorial says

train_encodings = tokenizer(seq_train, truncation=True, padding=True, 
max_length=1024, return_tensors="pt")

Unfortunately, the model doesn’t seem to be learning (I froze the BERT layers). From reading around, I saw that I need to add the [CLS] token and found such an option using

tokenised.encode(add_special_tokens=True)

Yet the tutorial I am following doesn’t seem to require and I was wondering wyy is there a discrepancy and perhaps maybe this is why my model isn’t learning.

Thank you

1 Like

Hi @theudster, I’m pretty sure that ProtBERT has a CLS token since you can see it in the tokenizer’s special tokens map:

from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("Rostlab/prot_bert")
# returns {'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}
tokenizer.special_tokens_map

You can also see it by encoding a text and then decoding it:

text = "I love Adelaide!"
# add_special_tokens=True is set by default
text_enc = tokenizer.encode(text)

for tok in text_enc:
    print(tok, tokenizer.decode(tok))

You say you froze the BERT layers, so I’m wondering how you’re doing fine-tuning? I’ve sometimes found that the tutorials in the docs aren’t always complete, so for fine-tuning with text classification I would recommend following Sylvain’s tutorial here: https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/token_classification.ipynb

Thank you @lewtun.

So are you saying that tokenized adds [CLS]?

Also, I am freezing the BERT layer but fine-tuning the classification layer. The reason I am doing that is because Colab crashes as it runs out of GPU if I don’t. If I try using just the CPU, I get “Your session crashed after using all available RAM.”
I therefore add

for param in model.bert.parameters():
        param.requires_grad = False

Which fixes that, but then the model doesn’t seem to learn anything :pensive:

1 Like

Yes, the tokenizers in transformers add the special tokens by default (see the docs here).

I’m not familiar with ProtBERT but I’m surprised its crashing Colab because the repo has some Colab examples: ProtTrans/ProtBert-BFD-FineTuning-MS.ipynb at master · agemagician/ProtTrans · GitHub

If you’re still having problems fine-tuning with a GPU, perhaps you can reduce the batch size to avoid the OOM errors

1 Like

thank you @lewtun. I have gone through their implementation and noticed that their per_device_train_batch_size is 1. Once I changed that, It works and I am actually getting some amazing results (its on its final epoch now, but currently hitting 80% accuracy :grinning: ).

I was hoping you could perhaps explain to me what this per_device_train_batch_size does and why was that the issue?

the per_device_train_batch_size specifies the batch size for the device you are training on (e.g. GPU/TPU/CPU), so if your training set has 1,000 examples and and per_device_train_batch_size=1 then it will take 1,000 steps to complete one epoch.

by increasing the value of per_device_train_batch_size you are able to train faster since it takes less steps to complete each epoch (e.g. if per_device_train_batch_size=4 then we only need 250 steps / epoch in our example), but this can sometimes lead to worse performance since the gradients are averaged / summed in a minibatch.

in your case, my guess is that with per_device_train_batch_size=1 you need to train for a very long time to see the model learn anything.

well, on Colab, it took a bit over an hour to train 3 epochs, not sure if that is long or not

oh what i meant by “long” is that you may need to run for many epochs before you start seeing any flattening out of you training / validation loss (i.e “convergence”). if you saw the validation loss drop during those 3 epochs then i am not sure what else might have gone wrong in your ProtBERT example