How to calculate word and sentence embedding using GPT-2?

anon45690166 · January 3, 2024, 9:53am

I’m working on a program that calculates word and sentence embeddings using GPT-2, specifically the GPT2Model class. For word embedding, I extract the last hidden state outputs[0] after forwarding the input_ids, that has a shape of batch size x seq len, to the GPT2Model class. As for sentence embedding, I extract the hidden state of the word at the end of sequence. This is the code I have tried:

from transformers import GPT2Tokenizer, GPT2Model
import torch

tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2Model.from_pretrained('gpt2')
captions = ["example caption", "example bird", "the bird is yellow has red wings", "hi", "very good"]

encoded_captions = [tokenizer.encode(caption) for caption in captions]

# Pad sequences to the same length with 0s
max_len = max(len(seq) for seq in encoded_captions)
padded_captions = [seq + [0] * (max_len - len(seq)) for seq in encoded_captions]

# Convert to a PyTorch tensor with batch size 5
input_ids = torch.tensor(padded_captions)

outputs = model(input_ids)
word_embedding = outputs[0].contiguous()
sentence_embedding = word_embedding[ :, -1, : ].contiguous()

I’m not sure if my calculation for word and sentence embedding are correct, can anyone help me confirm this? Thanks for your help

Topic		Replies	Views
Returned Tensors and Hidden State Beginners	4	2291	September 5, 2020
Recovering input IDs from input embeddings using GPT-2 Models	1	1032	March 1, 2023
Feeding embeddings to `model.generate` Models	0	524	December 1, 2022
GPT-2 Perplexity Score Normalized on Sentence Lenght? Beginners	2	1668	October 15, 2021
How to calculate GPT-2 sentence loss (for each sentence) if batch has 2 or more sentences? Beginners	1	812	April 17, 2023

How to calculate word and sentence embedding using GPT-2?

Related Topics