How to set the padding configuration with Huggingface's GenerateMixin's generate method?

nasheed · September 26, 2023, 9:25am

I am trying to batch-generate text 16 at a time. While tokenizing I left pad all my sequences and set the pad_token as equal to the eos_token. Since I don’t see a link between the generate method and the tokenizer used to tokenize the input, how do I set it up?

Here is a small code snippet of what I am trying to do:

from transformers import GPT2Tokenizer, GPT2LMHeadModel
import torch

tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
tokenizer.padding_side = "left" 
tokenizer.pad_token = tokenizer.eos_token # to avoid an error
model = GPT2LMHeadModel.from_pretrained('gpt2')

device = 'cuda' if torch.cuda.is_available() else 'cpu'

texts = ["this is a first prompt", "this is a second prompt"]
encoding = tokenizer(texts, padding=True, return_tensors='pt').to(device)
with torch.no_grad():
    generated_ids = model.generate(**encoding)
generated_texts = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)

Two sequences of unequal length look like this rn:

[[2, 2, 2, 2, 323, 244, 34, 55, 1, 1, 1, 1]
 [2, 2, 23, 3225, 323, 244, 34, 55, 51, 61, 41, 612]]

Where 1 is the <unk> token and 2 is eos_token

Here is the current behavior when we batch generate:

The generated text is padded on the right.
The padding token used by generate() is the <unk> token of the model’s tokenizer.

I want to be able to:

Change the padding side to left.
Use a different token for padding. For instance the eos_token that I already set the tokenizer to use in the snippet above.

dblakely · September 26, 2023, 10:48am

I think your snippet should already work correctly. Are you seeing any errors?

nasheed · September 26, 2023, 11:09am

No, I do not face errors in the above snippet. I will edit the question to better explain what I want.

dblakely · September 26, 2023, 12:45pm

Use a different token for padding. For instance the eos_token that I already set the tokenizer to use in the snippet above.

I see. You can pass pad_token_id to the generate call to do this. For example:

generated_ids = model.generate(**encoding, pad_token_id=tokenizer.eos_token_id)

Or you can set it in the model’s generation config:

model.generation_config.pad_token_id = tokenizer.eos_token_id

As for this part:

Change the padding side to right.

I’m still a bit confused about what you’re asking. During generation, the padding is already on the right in the outputs you posted. Or you’re saying you want the tokenizer to pad on the right?

nasheed · September 26, 2023, 2:05pm

As for this part:

Change the padding side to right.

I’m still a bit confused about what you’re asking. During generation, the padding is already on the right in the outputs you posted. Or you’re saying you want the tokenizer to pad on the right?

Oops I meant left.

Morning grogginess I guess. I am editing the Original post now.

nasheed · September 26, 2023, 2:09pm

Finally, would I be right to say that the args generally taken by the tokenizer to pad, determine the side of padding, length to pad until, maybe truncation (although I do not see the utility of this), e.t.c. can all be passed as args to the generate() method?

dblakely · September 26, 2023, 2:34pm

I don’t think there’s any way to do this - if you take a look here, you see that if a sequence in the batch is finished, the HF code will set the next token to the pad token and then concatenate it to the end of the finished sequence (in other words, pad on the right).

However, this is totally fine. When you de-tokenize the outputs, the tokenizer will remove all the pad tokens regardless of which side they’re on. Is there a reason you want the pad tokens created during generation to be on the left (except perhaps for consistency with the tokenizer)?

Finally, would I be right to say that the args generally taken by the tokenizer to pad, determine the side of padding, length to pad until, maybe truncation (although I do not see the utility of this), e.t.c. can all be passed as args to the generate() method?

No, not all of them. Just pad_token_id, bos_token_id, and eos_token_id. Kwargs passed to generate get passed along to update the GenerationConfig object (you can see the class here and the args it can take). The GenerationConfig only has a few tokenizer-related args.

nasheed · September 26, 2023, 3:19pm

I see. Part of why I’d like such a feature is maybe completeness (as you drew the parallel to the tokenizer class). But I can also imagine the case where you are calling generate within a loop and are iteratively feeding the generated sequences from the previous step as input to the next. The only way to do this right now is to batch_decode and then retokenize the sequences with left padding. (Since padding needs to be left when batch generating.

PS: maybe a method like prep_for_batch_generation on the AutoTokenizer class would be nice.

Much love,
Nasheed

Topic		Replies	Views
Dynamically resizing input for Huggingface's generate() 🤗Transformers	0	32	August 2, 2024
Make correct padding for text generation with GPT-NEO 🤗Tokenizers	0	821	July 5, 2023
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation Beginners	5	46138	September 24, 2024
What is the correct format of input when fine-tuning GPT2 for text generation with batch input? Models	0	507	January 22, 2024
How to properly tokenize and pack sequences with EOS token handling for GPT-2 fine-tuning in Hugging Face Transformers? Beginners	2	689	August 21, 2024

How to set the padding configuration with Huggingface's GenerateMixin's generate method?

Related topics