How to do batch generation with the GPT2 model?
Batch generation is now possible for GPT2 in master by leveraging the functionality shown in this PR: https://github.com/huggingface/transformers/pull/7552?notification_referrer_id=MDE4Ok5vdGlmaWNhdGlvblRocmVhZDEyMTMzNzA0MDA6MjM0MjM2MTk%3D#event-3876130796 .
For more info on how to prepare a GPT2 for batch generation, you can checkout this test:
Hi I am the author of the PR.
You can now do batch generation by calling the same generate()
.
All you need to add is:
- set
tokenizer.padding_side = "left"
(probably reset it back later) - pass in
attention_mask
togenerate()
Explanation: (see full example in the end)
- We need
tokenizer.padding_side = "left"
because we will use the logits of the right-most token to predict the next token, so the padding should be on the left. - This what this PR added. Here is a summary:
GPT-2 uses absolute positional embedding (position_ids
), before this change, no position_ids
is passed in to the model, and the model automatically generates the embeddings from 0 to n, even if there is padding (e.g. when input is a batch).
Example: tokens=<pad> <pad> a b c
-> position_ids=0 1 2 3 4
, what we expect is x x 0 1 2
(x
means donât case)
This PR adds positional embedding in prepare_inputs_for_generation()
, which is called in generate()
, by calculating them using
attention_mask
, and thatâs why you need to pass it in.
You can find a full example in PR.
Hi, there. Thanks for your work to support batch inference in GPT2. However, I still have one confusion, which may need your help. Thanks in advance!
If I wanna pass the âpast_key_valuesâ, how should I process the position_ids and attention mask? Supposing the length of my past_key_values is 2, the padded input is just like your example: , , a, b, c. Should I change the attention mask from 0, 0, 1, 1, 1 to 1, 1, 0, 0, 1, 1, 1, where the first double â1â refers. to the past_key_values.
Thanks a lot!
@patrickvonplaten @ttj I think this is a good question! Could we discuss on how to do batch inference with past_key_values
?
Is it possible to have variable max_gen_length
? depending on the length of the input sequence, for instance? (e.g. max_gen_length = len(tokenizer.tokenize(input_seq) + 20)
?
It looks like you are looking for max_new_tokens
?
hi, Iâm using the input parameter âpast_key_valuesâ to train a gpt model. So I wonder when doing batch generation in this way, if I pass âpast_key_valuesâ to model through the parameter âmodel_kwargsâ, whether the generation method will work as expected?
Thx!
Hello,
Correct me if Iâm wrong but for GPT2, during training mode, when padding_side='left'
, position_ids should be passed correctly or else the position_ids is created as if right padding is used ⌠making it inconsistent, right ?
Iâm referring to this line where the position_ids is created if it is not passed âŚ
transformers/modeling_gpt2.py at main ¡ huggingface/transformers (github.com)
Thank you !
Hey @lqtrung The position_ids
donât need to be passed, as long as the right attention_mask
is. prepare_inputs_for_generation
(see here) takes care of that for you
Hi @joaogante , thank you for the response.
I believe that the position_ids is properly prepared during generation as you said because the prepare_inputs_for_generation is called âŚ
But my question is about during training where that function is not called and the gpt2 modeling script does not compute position_ids based on the attention mask (so it is not correct when âleftâ padding is used âŚ)
So Iâm not sure about the recommended practice:
- Is ârightâ padding always used during training ⌠and âleftâ padding is only used during batch generation ?
- Or the training and generation should have the same padding scheme and in this case the gpt2 modeling script should handle the position_ids better ?
@lqtrung what you described as option 1. (right padding during training, left padding during inference) is the way to go.
You can also always pass position_ids
, but the settings above get you the correct results without passing them. A caveat here is that you never want GPT2 to generate after its pad token (note: GPT2 doesnât have a pad token, but it is common to set pad token = eos token), even if you pass the correct position_ids
. GPT2 was not trained for that case, and the results will be gibberish â right padding will often get you in this situation.
A good resource to reason about this is the illustrated GPT2
The code is designed to pad all tokens to the same length, equivalent to the maximum length of one input. However, itâs important to note that during inference, the output lengths of different inputs can vary.
After testing the code with nine sentences, where the input lengths range from 32 to 1140, it was observed that model.generate(inputs_padded)
completed after only one forward pass (decoding). This suggests that the model didnât perform decoding correctly.
An additional attempt was made using model.generate(inputs_padded, max_new_tokens=64)
. However, this resulted in an error related to CUDA. This is probably because some sentences has completed generation while others have not.
Any suggestions to solve these problems?