BERT and GPT2 embedding questions

dreidizzle · December 25, 2022, 2:19pm

Hi, I have two questions related to the embeddings I am getting from a BERT model and a GPT2 model. I am inputting a sentence of 4 words. For GPT2 I get 4 tokens, for BERT I get 6 since I add SEP and CLS.

If I want to “summarize” the sentence into one vector with BERT, should I use the CLS embedding or the mean of the tokens within the sentence? The CLS token was added for the NSP task so it some sense it captures if one sentence is related to the next sentence for that pre training task. However, what if there is no second sentence like in this case?
The embedding I get from GPT2, what exactly are they? My understanding is you feed data to GPT2 in an autoregressive fashion and decode with it but, I get 4 vectors for the 4 words/tokens. Is this literally just putting 4 words through the decoder and not masking? To use GPT2, I’d put a word one at a time and see what it generates. But, I’m unsure what these 4 embeddings mean in this case and how you use them. I guess you’d use them the way ELMO (another language model) uses embeddings? I.e., for that, you basically use a weighted sum of all embeddings for each time step …

text = "this is a sentence"
tokenizer_bert = BertTokenizer.from_pretrained('bert-base-uncased')
tokenizer_gpt2 = GPT2Tokenizer.from_pretrained("gpt2")
model_bert = BertModel.from_pretrained("bert-base-uncased")
model_gpt2 = GPT2Model.from_pretrained("gpt2")
encoded_input_bert = tokenizer_bert(text, return_tensors='pt')
encoded_input_gpt2 = tokenizer_gpt2(text, return_tensors='pt')
output_gpt2 = model_gpt2(**encoded_input_)
output_bert = model_bert(**encoded_input_bert)

savasy · December 28, 2022, 7:59pm

We can treat CLS token as sentence embeddings. Do not worry about NSP objective,just use it. Btw, to get the best sentence representation, you need to use SBERT.

dreidizzle · December 28, 2022, 8:37pm

Thank you! But, do you know why this is the case or have a reference? Not about Sentence Bert’s but about the CLS token vector vs averaging all the vectors for the other tokens … It seems like CLS is a type of difference between two vectors sentence vectors, while the mean pooling is a direct summary. I wonder what the correlation is, etc

Topic		Replies	Views
What should be used as sentence embedding for BertModel? Beginners	0	1909	May 24, 2021
Generating sentence embeddings from pretrained transformers model Intermediate	1	1091	January 22, 2021
Sentences' embeddings from BERT cross-encoder 🤗Transformers	0	275	December 22, 2022
Generate raw word embeddings using transformer models like BERT for downstream process Beginners	9	39985	October 4, 2021
How to calculate word and sentence embedding using GPT-2? Beginners	0	631	January 3, 2024

BERT and GPT2 embedding questions

Related topics