BERT and GPT2 embedding questions

Hi, I have two questions related to the embeddings I am getting from a BERT model and a GPT2 model. I am inputting a sentence of 4 words. For GPT2 I get 4 tokens, for BERT I get 6 since I add SEP and CLS.

  1. If I want to “summarize” the sentence into one vector with BERT, should I use the CLS embedding or the mean of the tokens within the sentence? The CLS token was added for the NSP task so it some sense it captures if one sentence is related to the next sentence for that pre training task. However, what if there is no second sentence like in this case?

  2. The embedding I get from GPT2, what exactly are they? My understanding is you feed data to GPT2 in an autoregressive fashion and decode with it but, I get 4 vectors for the 4 words/tokens. Is this literally just putting 4 words through the decoder and not masking? To use GPT2, I’d put a word one at a time and see what it generates. But, I’m unsure what these 4 embeddings mean in this case and how you use them. I guess you’d use them the way ELMO (another language model) uses embeddings? I.e., for that, you basically use a weighted sum of all embeddings for each time step …

text = "this is a sentence"
tokenizer_bert = BertTokenizer.from_pretrained('bert-base-uncased')
tokenizer_gpt2 = GPT2Tokenizer.from_pretrained("gpt2")
model_bert = BertModel.from_pretrained("bert-base-uncased")
model_gpt2 = GPT2Model.from_pretrained("gpt2")
encoded_input_bert = tokenizer_bert(text, return_tensors='pt')
encoded_input_gpt2 = tokenizer_gpt2(text, return_tensors='pt')
output_gpt2 = model_gpt2(**encoded_input_)
output_bert = model_bert(**encoded_input_bert)

We can treat CLS token as sentence embeddings. Do not worry about NSP objective,just use it. Btw, to get the best sentence representation, you need to use SBERT.

1 Like

Thank you! But, do you know why this is the case or have a reference? Not about Sentence Bert’s but about the CLS token vector vs averaging all the vectors for the other tokens … It seems like CLS is a type of difference between two vectors sentence vectors, while the mean pooling is a direct summary. I wonder what the correlation is, etc