Batch generation with GPT2

How to do batch generation with the GPT2 model?

1 Like

Batch generation is now possible for GPT2 in master by leveraging the functionality shown in this PR: https://github.com/huggingface/transformers/pull/7552?notification_referrer_id=MDE4Ok5vdGlmaWNhdGlvblRocmVhZDEyMTMzNzA0MDA6MjM0MjM2MTk%3D#event-3876130796 .

For more info on how to prepare a GPT2 for batch generation, you can checkout this test:

4 Likes

Hi I am the author of the PR.

You can now do batch generation by calling the same generate().
All you need to add is:

  1. set tokenizer.padding_side = "left" (probably reset it back later)
  2. pass in attention_mask to generate()

Explanation: (see full example in the end)

  1. We need tokenizer.padding_side = "left" because we will use the logits of the right-most token to predict the next token, so the padding should be on the left.
  2. This what this PR added. Here is a summary:

GPT-2 uses absolute positional embedding (position_ids), before this change, no position_ids is passed in to the model, and the model automatically generates the embeddings from 0 to n, even if there is padding (e.g. when input is a batch).

Example: tokens=<pad> <pad> a b c -> position_ids=0 1 2 3 4, what we expect is x x 0 1 2 (x means don’t case)

This PR adds positional embedding in prepare_inputs_for_generation(), which is called in generate(), by calculating them using
attention_mask, and that’s why you need to pass it in.

You can find a full example in PR.

7 Likes

Hi, there. Thanks for your work to support batch inference in GPT2. However, I still have one confusion, which may need your help. Thanks in advance!
If I wanna pass the “past_key_values”, how should I process the position_ids and attention mask? Supposing the length of my past_key_values is 2, the padded input is just like your example: , , a, b, c. Should I change the attention mask from 0, 0, 1, 1, 1 to 1, 1, 0, 0, 1, 1, 1, where the first double “1” refers. to the past_key_values.
Thanks a lot!

@patrickvonplaten @ttj I think this is a good question! Could we discuss on how to do batch inference with past_key_values?

Is it possible to have variable max_gen_length? depending on the length of the input sequence, for instance? (e.g. max_gen_length = len(tokenizer.tokenize(input_seq) + 20)?

It looks like you are looking for max_new_tokens?

2 Likes

hi, I’m using the input parameter “past_key_values” to train a gpt model. So I wonder when doing batch generation in this way, if I pass “past_key_values” to model through the parameter “model_kwargs”, whether the generation method will work as expected?
Thx!

1 Like

Hello,

Correct me if I’m wrong but for GPT2, during training mode, when padding_side='left', position_ids should be passed correctly or else the position_ids is created as if right padding is used … making it inconsistent, right ?

I’m referring to this line where the position_ids is created if it is not passed …
transformers/modeling_gpt2.py at main ¡ huggingface/transformers (github.com)

Thank you !

Hey @lqtrung :wave: The position_ids don’t need to be passed, as long as the right attention_mask is. prepare_inputs_for_generation (see here) takes care of that for you :smiley:

Hi @joaogante , thank you for the response.

I believe that the position_ids is properly prepared during generation as you said because the prepare_inputs_for_generation is called …

But my question is about during training where that function is not called and the gpt2 modeling script does not compute position_ids based on the attention mask (so it is not correct when ‘left’ padding is used …)

So I’m not sure about the recommended practice:

  1. Is ‘right’ padding always used during training … and ‘left’ padding is only used during batch generation ?
  2. Or the training and generation should have the same padding scheme and in this case the gpt2 modeling script should handle the position_ids better ?

@lqtrung what you described as option 1. (right padding during training, left padding during inference) is the way to go.

You can also always pass position_ids, but the settings above get you the correct results without passing them. A caveat here is that you never want GPT2 to generate after its pad token (note: GPT2 doesn’t have a pad token, but it is common to set pad token = eos token), even if you pass the correct position_ids. GPT2 was not trained for that case, and the results will be gibberish – right padding will often get you in this situation.

A good resource to reason about this is the illustrated GPT2 :slight_smile:

The code is designed to pad all tokens to the same length, equivalent to the maximum length of one input. However, it’s important to note that during inference, the output lengths of different inputs can vary.

After testing the code with nine sentences, where the input lengths range from 32 to 1140, it was observed that model.generate(inputs_padded) completed after only one forward pass (decoding). This suggests that the model didn’t perform decoding correctly.

An additional attempt was made using model.generate(inputs_padded, max_new_tokens=64). However, this resulted in an error related to CUDA. This is probably because some sentences has completed generation while others have not.

Any suggestions to solve these problems?