Is it OK to get word embedding without adding special tokens?

Alethia · September 27, 2022, 5:19pm

Hi,

I’m a beginner of huggingface and BERT, I’m finding ways to get word embeddings from BERT.

I’ve read several posts on the internet and I found that people always add the special tokens [CLS] and [SEP] in the sentence before they feed the tensor to the BERT model and get word embedding from last_hidden_state.

I understand the idea of this since this follows the training method of BERT, but I still wonder if it is possible to get the word embedding without adding the special tokens ?
And if it is possible, is there any difference between the two methods (add or don’t add)?

Weiheng · September 28, 2022, 2:14am

Hi @Alethia , you can manipulate (e.x. remove [CLS] and [SEP] token) the ‘input_ids’ which is generated by BERT tokenzier. Using modified ‘input_ids’, you may observe how BERT handle inputs without special tokens.

AFAIK, BERT can handle it without any fatal error. However, to remove special tokens brings no befinits. The output of [CLS] token is considered as ‘sentence embedding’, a representation of input as a whole. If [CLS] is removed, how can we get this information? The last token output might be a solution (like what RNN and LSTM does), but this implementation leads to weak points such as oblivion and bias.

To sum up, you can remove special tokens, but it harms the model performance.

Alethia · September 29, 2022, 8:15am

Hi Weiheng,

I got the idea, thank you for your help!

sapphirex · April 15, 2023, 8:00am

Hi. Your answer helps me too. But I still wanna to ask a question. Can I just use the word embedding from BERT without special tokens if I don’t need the output of [CLS] token? That means I don’t need the ‘sentence embedding’.

Topic		Replies	Views
Process to adding new tokens to a corpus and subsequently training the corresponding word embeddings Beginners	0	3767	April 21, 2021
Accessing uncontextualized BERT word embeddings Beginners	2	1502	October 30, 2020
What should be used as sentence embedding for BertModel? Beginners	0	1909	May 24, 2021
TFBertModel for classification task with no CLS token Beginners	0	344	March 11, 2023
How to concatenate the word embedding for special tokens and words Intermediate	1	2514	June 13, 2021

Is it OK to get word embedding without adding special tokens?

Related topics