Disabling addition of CLS from BERT tokenizer

Hi. I am using the BERT Huggingface tokenizer for a sequence-to-sequence task. I am using an LSTM-based Encoder-Decoder architecture. I want my input to the decoder to start with the sep_token followed by the target sentence shifted one character to the right. I built the sentence as I wished to, but when I tokenized it, the [CLS] token is always added to the beginning of the sentence. How can I disable this addition? Thanks in advance for any help you can provide.

Hello there.

The [CLS] token is added by the tokenizer’s post_processor. I believe you will have to use the HF tokenizers library to define your own. See: Input sequences — tokenizers documentation

You should be able to add add_special_tokens=False to the tokenizer(your_input, add_special_tokens=False) call.

Wouldn’t that also disable the sep_token OP mentioned?

It seems to me that they want to add a SEP token at the start of the string, but that the other parts are already formatted as they want/do not need other special tokens. So they should be able to do something like this (untested):

encoded = tokenizer(your_input, add_special_tokens=False)
encoded["input_ids"] = [tokenizer.sep_token_id] + encoded["input_ids"]
2 Likes

Thank you. That solved the issue.