Disabling addition of CLS from BERT tokenizer

AfonsoSousa · March 9, 2022, 2:41pm

Hi. I am using the BERT Huggingface tokenizer for a sequence-to-sequence task. I am using an LSTM-based Encoder-Decoder architecture. I want my input to the decoder to start with the sep_token followed by the target sentence shifted one character to the right. I built the sentence as I wished to, but when I tokenized it, the [CLS] token is always added to the beginning of the sentence. How can I disable this addition? Thanks in advance for any help you can provide.

christopher · March 11, 2022, 7:13am

Hello there.

The [CLS] token is added by the tokenizer’s post_processor. I believe you will have to use the HF tokenizers library to define your own. See: Input sequences — tokenizers documentation

BramVanroy · March 11, 2022, 7:43am

You should be able to add add_special_tokens=False to the tokenizer(your_input, add_special_tokens=False) call.

christopher · March 11, 2022, 8:19am

Wouldn’t that also disable the sep_token OP mentioned?

BramVanroy · March 11, 2022, 8:39am

It seems to me that they want to add a SEP token at the start of the string, but that the other parts are already formatted as they want/do not need other special tokens. So they should be able to do something like this (untested):

encoded = tokenizer(your_input, add_special_tokens=False)
encoded["input_ids"] = [tokenizer.sep_token_id] + encoded["input_ids"]

AfonsoSousa · March 11, 2022, 9:39am

Thank you. That solved the issue.

Topic		Replies	Views
TFBertModel for classification task with no CLS token Beginners	0	344	March 11, 2023
Does AutoTokenizer.from_pretrained add [cls] tokens? 🤗Tokenizers	7	5299	March 2, 2021
Should cls_token be [CLS] or <cls>? 🤗Tokenizers	3	278	October 11, 2023
Is it OK to get word embedding without adding special tokens? Beginners	3	1364	April 15, 2023
Special tokens with inputs_embeds input Beginners	0	261	July 10, 2023

Disabling addition of CLS from BERT tokenizer

Related topics