I am trying to batch-generate text 16 at a time. While tokenizing I left pad all my sequences and set the pad_token as equal to the eos_token. Since I don’t see a link between the generate method and the tokenizer used to tokenize the input, how do I set it up?
Here is a small code snippet of what I am trying to do:
from transformers import GPT2Tokenizer, GPT2LMHeadModel
import torch
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
tokenizer.padding_side = "left"
tokenizer.pad_token = tokenizer.eos_token # to avoid an error
model = GPT2LMHeadModel.from_pretrained('gpt2')
device = 'cuda' if torch.cuda.is_available() else 'cpu'
texts = ["this is a first prompt", "this is a second prompt"]
encoding = tokenizer(texts, padding=True, return_tensors='pt').to(device)
with torch.no_grad():
generated_ids = model.generate(**encoding)
generated_texts = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
Two sequences of unequal length look like this rn:
[[2, 2, 2, 2, 323, 244, 34, 55, 1, 1, 1, 1]
[2, 2, 23, 3225, 323, 244, 34, 55, 51, 61, 41, 612]]
Where 1
is the <unk>
token and 2
is eos_token
Here is the current behavior when we batch generate:
- The generated text is padded on the right.
- The padding token used by
generate()
is the<unk>
token of the model’s tokenizer.
I want to be able to:
-
Change the padding side to
left
. -
Use a different token for padding. For instance the
eos_token
that I already set the tokenizer to use in the snippet above.