Hi, I have two questions related to the embeddings I am getting from a BERT model and a GPT2 model. I am inputting a sentence of 4 words. For GPT2 I get 4 tokens, for BERT I get 6 since I add SEP and CLS.
If I want to “summarize” the sentence into one vector with BERT, should I use the CLS embedding or the mean of the tokens within the sentence? The CLS token was added for the NSP task so it some sense it captures if one sentence is related to the next sentence for that pre training task. However, what if there is no second sentence like in this case?
The embedding I get from GPT2, what exactly are they? My understanding is you feed data to GPT2 in an autoregressive fashion and decode with it but, I get 4 vectors for the 4 words/tokens. Is this literally just putting 4 words through the decoder and not masking? To use GPT2, I’d put a word one at a time and see what it generates. But, I’m unsure what these 4 embeddings mean in this case and how you use them. I guess you’d use them the way ELMO (another language model) uses embeddings? I.e., for that, you basically use a weighted sum of all embeddings for each time step …
text = "this is a sentence" tokenizer_bert = BertTokenizer.from_pretrained('bert-base-uncased') tokenizer_gpt2 = GPT2Tokenizer.from_pretrained("gpt2") model_bert = BertModel.from_pretrained("bert-base-uncased") model_gpt2 = GPT2Model.from_pretrained("gpt2") encoded_input_bert = tokenizer_bert(text, return_tensors='pt') encoded_input_gpt2 = tokenizer_gpt2(text, return_tensors='pt') output_gpt2 = model_gpt2(**encoded_input_) output_bert = model_bert(**encoded_input_bert)