What does it mean to prime a GPT model?

I am not sure I understand what it means to prime a LM. I came across this concept in several blogposts and papers (sometimes also referred to as exploring the capabilities of meta learning of the model or as in context learning).

From the openai gpt2 paper, section 3.7 Translation:

We test whether GPT-2 has begun to learn how to translate
from one language to another. In order to help it infer that
this is the desired task, we condition the language model
on a context of example pairs of the format english
sentence = french sentence and then after a final prompt of english sentence = 

This I believe is an example of priming? Since with transformers there is no concept of hidden state being passed from one step to another, we provide the model with an input sequence of tokens of up to 1024 length and the model will output up to 1024 x vocab size softmax activations where each will encode the probability of the subsequent word (following the word at a given position). So priming would be just constructing the input sequence in a specific manner?

If I am reading this correctly, priming would refer to the act of passing a sequence into the model expecting that the model’s meta learning capability would affect its output?

In this sense, for priming, we are always limited to a sequence of < 1024 tokens (where 1024 need to suffice for the priming sequence and the output)?

Passing the past parameter just saves on compute, it provides the model with the key value pairs calculated at earlier steps of text generation but there is nothing else magical happening there?

And last but not least - are such questions okay to ask? Meaning, this would certainly qualify as a beginner question, but it doesn’t directly relate to the library I suppose. I really appreciate the amazing resource you put out there, the transformer library along with the wonderful documentation, in fact I am blown over by how awesome it is, just would like to make sure I am not bothering you with my questions and am using the forums in a way that they were intended to be used.

Thank you very much! :pray:


If I am reading this correctly, priming would refer to the act of passing a sequence into the model expecting that the model’s meta learning capability would affect its output?

You’ve nailed it on the head. When talking about a left-to-right model like GPT-N, priming is just prepending text that is similar in some way to the text you are predicting which often helps the model to predict it correctly.

Incidentally, this is the thing that GPT-3 seems to be especially good at. There seems to be something about language models that we don’t completely understand that can make priming a surprisingly effective meta-learning technique, especially when the models get really big. See this Twitter thread for some examples.

And yes, this kind of question is perfect for the forums. However, I’d say Research is probably a better category fit since this more about general NLP/research talk and rather than the HF libraries :slight_smile:


Thank you very much for your answer Joe, really appreciate it! :slight_smile: And thank you for linking to the Twitter thread - super interesting. Will keep note of the Research category going forward!

Just as an informative comment: priming is actually a term from psychology and perhaps peculiarly psycholinguistics. I am doing some research into this. An example of priming is: if you show participants a whole number of sentences, and most of those use a passive construction (“The apple was eaten by the man.”), and then show them a picture and ask them to describe it, and they describe what they see with a passive then they were (unconsciously) primed by the earlier texts.


They used the term ‘condition’ but it’s of course not truly conditional compared to methods like CTRL and PPLM. So referring to it as ‘priming’ might be a great choice.

Personally I use them interchangeably in this context. I have a slight preference for “priming” because IMO it’s more evocative in communicating what you’re trying to accomplish with this particular kind of conditioning, but I think either works (conditioning is probably more common?).