The effect of padding_side

Hello, I have a question about the documentation here (Generation with LLMs). Below is a code block, and I’m curious why setting padding_side to ‘left’ yields the correct inference result, while setting it to ‘right’ does not work. The attention_mask is also passed to the model’s generate method, so theoretically, it should be able to correctly infer the next token.

# The tokenizer initialized above has right-padding active by default: the 1st sequence,
# which is shorter, has padding on the right side. Generation fails to capture the logic.
model_inputs = tokenizer(
    ["1, 2, 3", "A, B, C, D, E"], padding=True, return_tensors="pt"
).to("cuda")
generated_ids = model.generate(**model_inputs)
tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

# With left-padding, it works as expected!
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1", padding_side="left")
tokenizer.pad_token = tokenizer.eos_token  # Most LLMs don't have a pad token by default
model_inputs = tokenizer(
    ["1, 2, 3", "A, B, C, D, E"], padding=True, return_tensors="pt"
).to("cuda")
generated_ids = model.generate(**model_inputs)
tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

Hi,

This is explained here: Generation with LLMs.

LLMs are decoder-only architectures, meaning they continue to iterate on your input prompt. If your inputs do not have the same length, they need to be padded. Since LLMs are not trained to continue from pad tokens, your input needs to be left-padded.

3 Likes

Hi @nielsr , thanks for your reply. I understand the role of padding, the point that actually confused me was why padding right affects the output of the model, since the attention mask has already been passed in, the padding should be masked out in atten_weight, and theoretically it shouldn’t have an effect.

@nielsr thanks for your help. After debugging the code, I found the key to the unexpected behavior (padding_side=‘right’) is the next_token comeing from the logit of pad token. I thought it would somehow get the logit of the last non-pad token as the predicted next token, but that’s not actually the case, it simply takes the last token (which could be a pad token).

        while True:
            if synced_gpus:
                # Under synced_gpus the `forward` call must continue until all gpus complete their sequence.
                # The following logic allows an early break if all peers finished generating their sequence
                this_peer_finished_flag = torch.tensor(0.0 if this_peer_finished else 1.0).to(input_ids.device)
                # send 0.0 if we finished, 1.0 otherwise
                dist.all_reduce(this_peer_finished_flag, op=dist.ReduceOp.SUM)
                # did all peers finish? the reduced sum will be 0.0 then
                if this_peer_finished_flag.item() == 0.0:
                    break

            # prepare model inputs
            model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs)

            # forward pass to get next token
            outputs = self(
                **model_inputs,
                return_dict=True,
                output_attentions=output_attentions,
                output_hidden_states=output_hidden_states,
            )

            if synced_gpus and this_peer_finished:
                continue  # don't waste resources running the code we don't need

            next_token_logits = outputs.logits[:, -1, :]
7 Likes

Hi dude, I couldnt quite understand the logic here

And one more thing: I saw this piece of code:

decided to pad on left side but with eos token ? Don’t the models automatically stop when they see eos tokens? Shouldn’t there be a problem here?

1 Like

Hi,

If models don’t have a padding token set one can use the EOS token as padding token, and pad from the left at inference time.

This is not an issue since the model will then see “<eos> <eos> <eos> (…) hello your name is” => then the model is prompted to continue the token “is”, so it will generate several new tokens until it will generate an EOS token.

2 Likes

is it like [EOS, EOS, EOS, Hello, your, name, is, … ]? Because in this format, model should stop since it sees the stop token. what is I’m missing ?

1 Like

Yes, sorry for Forum was hiding the <eos> tokens in my reply :stuck_out_tongue:

I didnt understand, what is the specific reason to use EOS to do padding it? Why we using EOS? and why left side? isn’t it the case that model stops when it sees the EOS token generated from itsel? (for example [BOS] Hi, how are you? [EOS]). For this example, shouldnt the model just stop since the model generated [EOS] token when the model tokenized “?” ?

It makes sense to use the EOS token when we set the padding side = right. Likewise, we can also use BOS (begin of sentece) tokens for padding, right? And it makes sense when we set the padding side = left. What am I missing?

1 Like

@DoganK01 from what I understand what happens is the model sees -
[eos] - nothing to generate
[eos] [eos] - nothing to generate
[eos] [eos] hello - generates logits for after hello

hope this clears it up for you!

1 Like

I cannot understand why huggingface implement like this. Why don’t they extract the last non-pad tokens of each sample?