Hi. I am using the BERT Huggingface tokenizer for a sequence-to-sequence task. I am using an LSTM-based Encoder-Decoder architecture. I want my input to the decoder to start with the sep_token followed by the target sentence shifted one character to the right. I built the sentence as I wished to, but when I tokenized it, the [CLS] token is always added to the beginning of the sentence. How can I disable this addition? Thanks in advance for any help you can provide.
Hello there.
The [CLS] token is added by the tokenizer’s post_processor. I believe you will have to use the HF tokenizers library to define your own. See: Input sequences — tokenizers documentation
You should be able to add add_special_tokens=False
to the tokenizer(your_input, add_special_tokens=False)
call.
Wouldn’t that also disable the sep_token OP mentioned?
It seems to me that they want to add a SEP token at the start of the string, but that the other parts are already formatted as they want/do not need other special tokens. So they should be able to do something like this (untested):
encoded = tokenizer(your_input, add_special_tokens=False)
encoded["input_ids"] = [tokenizer.sep_token_id] + encoded["input_ids"]
2 Likes
Thank you. That solved the issue.