Batch generation with GPT2

patrickvonplaten · October 13, 2020, 9:53pm

How to do batch generation with the GPT2 model?

patrickvonplaten · October 14, 2020, 1:56pm

Batch generation is now possible for GPT2 in master by leveraging the functionality shown in this PR: https://github.com/huggingface/transformers/pull/7552?notification_referrer_id=MDE4Ok5vdGlmaWNhdGlvblRocmVhZDEyMTMzNzA0MDA6MjM0MjM2MTk%3D#event-3876130796 .

For more info on how to prepare a GPT2 for batch generation, you can checkout this test:

github.com

huggingface/transformers/blob/890e790e16084e58a1ecb9329c98ec3e76c45994/tests/test_modeling_gpt2.py#L430



def test_gpt2_sequence_classification_model(self):
    config_and_inputs = self.model_tester.prepare_config_and_inputs()
    self.model_tester.create_and_check_gpt2_for_sequence_classification(*config_and_inputs)

def test_gpt2_gradient_checkpointing(self):
    config_and_inputs = self.model_tester.prepare_config_and_inputs(gradient_checkpointing=True)
    self.model_tester.create_and_check_forward_and_backwards(*config_and_inputs)

@slow
def test_batch_generation(self):
    model = GPT2LMHeadModel.from_pretrained("gpt2")
    model.to(torch_device)
    tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

    tokenizer.padding_side = "left"

    # Define PAD Token = EOS Token = 50256
    tokenizer.pad_token = tokenizer.eos_token
    model.config.pad_token_id = model.config.eos_token_id

ttj · October 14, 2020, 2:09pm

Hi I am the author of the PR.

You can now do batch generation by calling the same generate().
All you need to add is:

set tokenizer.padding_side = "left" (probably reset it back later)
pass in attention_mask to generate()

Explanation: (see full example in the end)

We need tokenizer.padding_side = "left" because we will use the logits of the right-most token to predict the next token, so the padding should be on the left.
This what this PR added. Here is a summary:

GPT-2 uses absolute positional embedding (position_ids), before this change, no position_ids is passed in to the model, and the model automatically generates the embeddings from 0 to n, even if there is padding (e.g. when input is a batch).

Example: tokens=<pad> <pad> a b c -> position_ids=0 1 2 3 4, what we expect is x x 0 1 2 (x means don’t case)

This PR adds positional embedding in prepare_inputs_for_generation(), which is called in generate(), by calculating them using
attention_mask, and that’s why you need to pass it in.

You can find a full example in PR.

Knight625 · November 12, 2021, 7:00pm

Hi, there. Thanks for your work to support batch inference in GPT2. However, I still have one confusion, which may need your help. Thanks in advance!
If I wanna pass the “past_key_values”, how should I process the position_ids and attention mask? Supposing the length of my past_key_values is 2, the padded input is just like your example: , , a, b, c. Should I change the attention mask from 0, 0, 1, 1, 1 to 1, 1, 0, 0, 1, 1, 1, where the first double “1” refers. to the past_key_values.
Thanks a lot!

deathcrush · February 28, 2022, 10:41am

@patrickvonplaten @ttj I think this is a good question! Could we discuss on how to do batch inference with past_key_values?

sasha · July 7, 2022, 5:09pm

Is it possible to have variable max_gen_length? depending on the length of the input sequence, for instance? (e.g. max_gen_length = len(tokenizer.tokenize(input_seq) + 20)?

sgugger · July 7, 2022, 5:48pm

It looks like you are looking for max_new_tokens?

milkBottle · September 29, 2022, 4:10am

hi, I’m using the input parameter “past_key_values” to train a gpt model. So I wonder when doing batch generation in this way, if I pass “past_key_values” to model through the parameter “model_kwargs”, whether the generation method will work as expected?
Thx!

lqtrung · February 10, 2023, 7:06am

Hello,

Correct me if I’m wrong but for GPT2, during training mode, when padding_side='left', position_ids should be passed correctly or else the position_ids is created as if right padding is used … making it inconsistent, right ?

I’m referring to this line where the position_ids is created if it is not passed …
transformers/modeling_gpt2.py at main · huggingface/transformers (github.com)

Thank you !

joaogante · February 15, 2023, 10:49am

Hey @lqtrung The position_ids don’t need to be passed, as long as the right attention_mask is. prepare_inputs_for_generation (see here) takes care of that for you

lqtrung · February 16, 2023, 3:03am

Hi @joaogante , thank you for the response.

I believe that the position_ids is properly prepared during generation as you said because the prepare_inputs_for_generation is called …

But my question is about during training where that function is not called and the gpt2 modeling script does not compute position_ids based on the attention mask (so it is not correct when ‘left’ padding is used …)

So I’m not sure about the recommended practice:

Is ‘right’ padding always used during training … and ‘left’ padding is only used during batch generation ?
Or the training and generation should have the same padding scheme and in this case the gpt2 modeling script should handle the position_ids better ?

joaogante · February 16, 2023, 10:54am

@lqtrung what you described as option 1. (right padding during training, left padding during inference) is the way to go.

You can also always pass position_ids, but the settings above get you the correct results without passing them. A caveat here is that you never want GPT2 to generate after its pad token (note: GPT2 doesn’t have a pad token, but it is common to set pad token = eos token), even if you pass the correct position_ids. GPT2 was not trained for that case, and the results will be gibberish – right padding will often get you in this situation.

A good resource to reason about this is the illustrated GPT2

marsggbo · January 16, 2024, 2:28am

The code is designed to pad all tokens to the same length, equivalent to the maximum length of one input. However, it’s important to note that during inference, the output lengths of different inputs can vary.

After testing the code with nine sentences, where the input lengths range from 32 to 1140, it was observed that model.generate(inputs_padded) completed after only one forward pass (decoding). This suggests that the model didn’t perform decoding correctly.

An additional attempt was made using model.generate(inputs_padded, max_new_tokens=64). However, this resulted in an error related to CUDA. This is probably because some sentences has completed generation while others have not.

Any suggestions to solve these problems?

Topic		Replies	Views
Different model.generate() predictions between batched and unbatched/padded token inputs 🤗Transformers	2	2292	August 26, 2023
Model.generate() -- IndexError: too many indices for tensor of dimension 2 Beginners	3	6143	November 4, 2021
Training GPT-2 from scratch Beginners	2	1249	August 3, 2020
Padding to the left of the inputs, GPT2LMHeadModel gives different answer Intermediate	2	1308	February 21, 2023
Generating [PAD] tokens during GPT2 inference Intermediate	0	1432	August 22, 2022

Batch generation with GPT2

Related topics