Hello, I have a question about the documentation here (Generation with LLMs). Below is a code block, and Iâm curious why setting padding_side to âleftâ yields the correct inference result, while setting it to ârightâ does not work. The attention_mask is also passed to the modelâs generate method, so theoretically, it should be able to correctly infer the next token.
# The tokenizer initialized above has right-padding active by default: the 1st sequence,
# which is shorter, has padding on the right side. Generation fails to capture the logic.
model_inputs = tokenizer(
["1, 2, 3", "A, B, C, D, E"], padding=True, return_tensors="pt"
).to("cuda")
generated_ids = model.generate(**model_inputs)
tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
# With left-padding, it works as expected!
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1", padding_side="left")
tokenizer.pad_token = tokenizer.eos_token # Most LLMs don't have a pad token by default
model_inputs = tokenizer(
["1, 2, 3", "A, B, C, D, E"], padding=True, return_tensors="pt"
).to("cuda")
generated_ids = model.generate(**model_inputs)
tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
LLMs are decoder-only architectures, meaning they continue to iterate on your input prompt. If your inputs do not have the same length, they need to be padded. Since LLMs are not trained to continue from pad tokens, your input needs to be left-padded.
Hi @nielsr , thanks for your reply. I understand the role of padding, the point that actually confused me was why padding right affects the output of the model, since the attention mask has already been passed in, the padding should be masked out in atten_weight, and theoretically it shouldnât have an effect.
@nielsr thanks for your help. After debugging the code, I found the key to the unexpected behavior (padding_side=ârightâ) is the next_token comeing from the logit of pad token. I thought it would somehow get the logit of the last non-pad token as the predicted next token, but thatâs not actually the case, it simply takes the last token (which could be a pad token).
while True:
if synced_gpus:
# Under synced_gpus the `forward` call must continue until all gpus complete their sequence.
# The following logic allows an early break if all peers finished generating their sequence
this_peer_finished_flag = torch.tensor(0.0 if this_peer_finished else 1.0).to(input_ids.device)
# send 0.0 if we finished, 1.0 otherwise
dist.all_reduce(this_peer_finished_flag, op=dist.ReduceOp.SUM)
# did all peers finish? the reduced sum will be 0.0 then
if this_peer_finished_flag.item() == 0.0:
break
# prepare model inputs
model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs)
# forward pass to get next token
outputs = self(
**model_inputs,
return_dict=True,
output_attentions=output_attentions,
output_hidden_states=output_hidden_states,
)
if synced_gpus and this_peer_finished:
continue # don't waste resources running the code we don't need
next_token_logits = outputs.logits[:, -1, :]
If models donât have a padding token set one can use the EOS token as padding token, and pad from the left at inference time.
This is not an issue since the model will then see â<eos> <eos> <eos> (âŚ) hello your name isâ => then the model is prompted to continue the token âisâ, so it will generate several new tokens until it will generate an EOS token.
is it like [EOS, EOS, EOS, Hello, your, name, is, ⌠]? Because in this format, model should stop since it sees the stop token. what is Iâm missing ?
I didnt understand, what is the specific reason to use EOS to do padding it? Why we using EOS? and why left side? isnât it the case that model stops when it sees the EOS token generated from itsel? (for example [BOS] Hi, how are you? [EOS]). For this example, shouldnt the model just stop since the model generated [EOS] token when the model tokenized â?â ?
It makes sense to use the EOS token when we set the padding side = right. Likewise, we can also use BOS (begin of sentece) tokens for padding, right? And it makes sense when we set the padding side = left. What am I missing?
@DoganK01 from what I understand what happens is the model sees -
[eos] - nothing to generate
[eos] [eos] - nothing to generate
[eos] [eos] hello - generates logits for after hello