Is it OK to get word embedding without adding special tokens?


I’m a beginner of huggingface and BERT, I’m finding ways to get word embeddings from BERT.

I’ve read several posts on the internet and I found that people always add the special tokens [CLS] and [SEP] in the sentence before they feed the tensor to the BERT model and get word embedding from last_hidden_state.

I understand the idea of this since this follows the training method of BERT, but I still wonder if it is possible to get the word embedding without adding the special tokens ?
And if it is possible, is there any difference between the two methods (add or don’t add)?

Hi @Alethia , you can manipulate (e.x. remove [CLS] and [SEP] token) the ‘input_ids’ which is generated by BERT tokenzier. Using modified ‘input_ids’, you may observe how BERT handle inputs without special tokens.

AFAIK, BERT can handle it without any fatal error. However, to remove special tokens brings no befinits. The output of [CLS] token is considered as ‘sentence embedding’, a representation of input as a whole. If [CLS] is removed, how can we get this information? The last token output might be a solution (like what RNN and LSTM does), but this implementation leads to weak points such as oblivion and bias.

To sum up, you can remove special tokens, but it harms the model performance.

1 Like

Hi Weiheng,

I got the idea, thank you for your help!