Suppose I have the following text
aim = 'Hello world! you are a wonderful place to be in.'
I want to use GPT2 to produce the input_ids and then produce the embedding and from embeddings recover the input_ids, to do this I do:
from transformers import GPT2Tokenizer, GPT2Model
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2Model.from_pretrained("gpt2")
The input_ids can be defines as:
input_ids = tokenizer(aim)['input_ids']
#output: [15496, 995, 0, 345, 389, 257, 7932, 1295, 284, 307, 287, 13]
I can decode this to make sure it reproduce the aim:
tokenizer.decode(input_ids)
#output: 'Hello world! you are a wonderful place to be in.'
as expected! To produce the embedding I convert the input_ids to tensor:
input_ids_tensor = torch.tensor([input_ids])
I can then procude my embeddings as:
# Generate the embeddings for input IDs
with torch.no_grad():
model_output = model(input_ids_tensor)
last_hidden_states = model_output.last_hidden_state
# Extract the embeddings for the input IDs from the last hidden layer
input_embeddings = last_hidden_states[0,1:-1,:]
Now as mentioned earlier, the aim is to use input_embeddings and recover the input_ids, so I do:
x = torch.unsqueeze(input_embeddings, 1) # to make the dim acceptable
with torch.no_grad():
text = model(x.long())
decoded_text = tokenizer.decode(text[0].argmax(dim=-1).tolist())
But doing this I get:
IndexError: index out of range in self
at the level of text = model(x.long())
I wonder what am I doing wrong? How can I recover the input_ids using the embedding I produced?