How to decode GPT2

What’s the proper way to decode the output of GPT2

from transformers import GPT2Tokenizer, TFGPT2Model
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = TFGPT2Model.from_pretrained('gpt2')
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='tf')
output = model(encoded_input)
tokenizer.decode(output)

This line tokenizer.decode(output) gives me this error

I don’t think the code that you’ve written will give you anything that a tokenizer can decode. By calling TFGPT2Model without a task head (e.g., TFGPTLMHeadModel for causal language modeling), you’re just getting back a TFBaseModelOutputWithPastAndCrossAttentions object with GPT-2’s 768-dimension word embeddings for each input token in output.last_hidden_state.

In your case, output.last_hidden_state is a tensor with shape (1, 10, 768) because you have one input with 10 tokens, and GPT-2 uses 768 embedding dimensions.

The HuggingFace model is to add a “modelling head” on top of the base model to help perform whatever NLP task you’re after. If you’re looking to get tokens you can decode, that’s probably causal language modelling.

A simple TensorFlow example for causal language modelling might look like:

from transformers import GPT2Tokenizer, TFGPT2LMHeadModel


def main():
    tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
    text = "Replace me by any text you'd like."
    encoded_input = tokenizer(text, return_tensors='tf')
    model = TFGPT2LMHeadModel.from_pretrained('gpt2')
    output = model.generate(**encoded_input)
    decoded = tokenizer.decode(output[0])
    print(decoded)


if __name__ == '__main__':
    main()

In this example, model.generate() is doing a lot of heavy lifting for you over calling model(encoded_input), most of which is controllable by its enormous number of additional parameters.

1 Like

For the sake of completeness, here’s a minimal example that does call the model directly (i.e. without generate()):

import tensorflow as tf
from transformers import GPT2Tokenizer, TFGPT2LMHeadModel


def main():
    tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
    text = "When it comes to making transformers easy, HuggingFace is the"
    encoded_input = tokenizer(text, return_tensors='tf')
    model = TFGPT2LMHeadModel.from_pretrained('gpt2')
    output = model(encoded_input)
    logits = output.logits[0, -1, :]
    softmax = tf.math.softmax(logits, axis=-1)
    argmax = tf.math.argmax(softmax, axis=-1)
    print(text, "[", tokenizer.decode(argmax), "]")


if __name__ == '__main__':
    main()

It generates exactly one word, which in this case should be “best” since it’s deterministically picking the highest probabibilty word in the output:

2022-03-28 15:45:30.783765: W tensorflow/python/util/util.cc:368] Sets are not currently considered sequences, but this may change in the future, so consider avoiding using them.
All model checkpoint layers were used when initializing TFGPT2LMHeadModel.

All the layers of TFGPT2LMHeadModel were initialized from the model checkpoint at gpt2.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFGPT2LMHeadModel for predictions without further training.
When it comes to making transformers easy, HuggingFace is the [  best ]
2 Likes

it seems that the softmax is redundancy when only the best token is required, because it doesn’t influence the order of logits.