How to set the padding configuration with Huggingface's GenerateMixin's generate method?

I am trying to batch-generate text 16 at a time. While tokenizing I left pad all my sequences and set the pad_token as equal to the eos_token. Since I don’t see a link between the generate method and the tokenizer used to tokenize the input, how do I set it up?

Here is a small code snippet of what I am trying to do:

from transformers import GPT2Tokenizer, GPT2LMHeadModel
import torch

tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
tokenizer.padding_side = "left" 
tokenizer.pad_token = tokenizer.eos_token # to avoid an error
model = GPT2LMHeadModel.from_pretrained('gpt2')

device = 'cuda' if torch.cuda.is_available() else 'cpu'

texts = ["this is a first prompt", "this is a second prompt"]
encoding = tokenizer(texts, padding=True, return_tensors='pt').to(device)
with torch.no_grad():
    generated_ids = model.generate(**encoding)
generated_texts = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)

Two sequences of unequal length look like this rn:

[[2, 2, 2, 2, 323, 244, 34, 55, 1, 1, 1, 1]
 [2, 2, 23, 3225, 323, 244, 34, 55, 51, 61, 41, 612]]

Where 1 is the <unk> token and 2 is eos_token

Here is the current behavior when we batch generate:

  • The generated text is padded on the right.
  • The padding token used by generate() is the <unk> token of the model’s tokenizer.

I want to be able to:

  • Change the padding side to left.

  • Use a different token for padding. For instance the eos_token that I already set the tokenizer to use in the snippet above.

I think your snippet should already work correctly. Are you seeing any errors?

No, I do not face errors in the above snippet. I will edit the question to better explain what I want.

  • Use a different token for padding. For instance the eos_token that I already set the tokenizer to use in the snippet above.

I see. You can pass pad_token_id to the generate call to do this. For example:

generated_ids = model.generate(**encoding, pad_token_id=tokenizer.eos_token_id)

Or you can set it in the model’s generation config:

model.generation_config.pad_token_id = tokenizer.eos_token_id

As for this part:

  • Change the padding side to right.

I’m still a bit confused about what you’re asking. During generation, the padding is already on the right in the outputs you posted. Or you’re saying you want the tokenizer to pad on the right?

1 Like

As for this part:

  • Change the padding side to right.

I’m still a bit confused about what you’re asking. During generation, the padding is already on the right in the outputs you posted. Or you’re saying you want the tokenizer to pad on the right?

Oops I meant left.

Morning grogginess I guess. I am editing the Original post now.

Finally, would I be right to say that the args generally taken by the tokenizer to pad, determine the side of padding, length to pad until, maybe truncation (although I do not see the utility of this), e.t.c. can all be passed as args to the generate() method?

I don’t think there’s any way to do this - if you take a look here, you see that if a sequence in the batch is finished, the HF code will set the next token to the pad token and then concatenate it to the end of the finished sequence (in other words, pad on the right).

However, this is totally fine. When you de-tokenize the outputs, the tokenizer will remove all the pad tokens regardless of which side they’re on. Is there a reason you want the pad tokens created during generation to be on the left (except perhaps for consistency with the tokenizer)?

Finally, would I be right to say that the args generally taken by the tokenizer to pad, determine the side of padding, length to pad until, maybe truncation (although I do not see the utility of this), e.t.c. can all be passed as args to the generate() method?

No, not all of them. Just pad_token_id, bos_token_id, and eos_token_id. Kwargs passed to generate get passed along to update the GenerationConfig object (you can see the class here and the args it can take). The GenerationConfig only has a few tokenizer-related args.

I see. Part of why I’d like such a feature is maybe completeness (as you drew the parallel to the tokenizer class). But I can also imagine the case where you are calling generate within a loop and are iteratively feeding the generated sequences from the previous step as input to the next. The only way to do this right now is to batch_decode​ and then retokenize the sequences with left​ padding. (Since padding needs to be left when batch generating.

PS: maybe a method like prep_for_batch_generation​ on the AutoTokenizer class would be nice.

Much love,
Nasheed