I’m working on a program that calculates word and sentence embeddings using GPT-2, specifically the GPT2Model
class. For word embedding, I extract the last hidden state outputs[0]
after forwarding the input_ids
, that has a shape of batch size x seq len
, to the GPT2Model
class. As for sentence embedding, I extract the hidden state of the word at the end of sequence. This is the code I have tried:
from transformers import GPT2Tokenizer, GPT2Model
import torch
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2Model.from_pretrained('gpt2')
captions = ["example caption", "example bird", "the bird is yellow has red wings", "hi", "very good"]
encoded_captions = [tokenizer.encode(caption) for caption in captions]
# Pad sequences to the same length with 0s
max_len = max(len(seq) for seq in encoded_captions)
padded_captions = [seq + [0] * (max_len - len(seq)) for seq in encoded_captions]
# Convert to a PyTorch tensor with batch size 5
input_ids = torch.tensor(padded_captions)
outputs = model(input_ids)
word_embedding = outputs[0].contiguous()
sentence_embedding = word_embedding[ :, -1, : ].contiguous()
I’m not sure if my calculation for word and sentence embedding are correct, can anyone help me confirm this? Thanks for your help