The effect of padding_side

zhouzaida · December 27, 2023, 4:32pm

Hello, I have a question about the documentation here (Generation with LLMs). Below is a code block, and I’m curious why setting padding_side to ‘left’ yields the correct inference result, while setting it to ‘right’ does not work. The attention_mask is also passed to the model’s generate method, so theoretically, it should be able to correctly infer the next token.

# The tokenizer initialized above has right-padding active by default: the 1st sequence,
# which is shorter, has padding on the right side. Generation fails to capture the logic.
model_inputs = tokenizer(
    ["1, 2, 3", "A, B, C, D, E"], padding=True, return_tensors="pt"
).to("cuda")
generated_ids = model.generate(**model_inputs)
tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

# With left-padding, it works as expected!
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1", padding_side="left")
tokenizer.pad_token = tokenizer.eos_token  # Most LLMs don't have a pad token by default
model_inputs = tokenizer(
    ["1, 2, 3", "A, B, C, D, E"], padding=True, return_tensors="pt"
).to("cuda")
generated_ids = model.generate(**model_inputs)
tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

nielsr · December 27, 2023, 7:56pm

Hi,

This is explained here: Generation with LLMs.

LLMs are decoder-only architectures, meaning they continue to iterate on your input prompt. If your inputs do not have the same length, they need to be padded. Since LLMs are not trained to continue from pad tokens, your input needs to be left-padded.

zhouzaida · December 28, 2023, 2:14am

Hi @nielsr , thanks for your reply. I understand the role of padding, the point that actually confused me was why padding right affects the output of the model, since the attention mask has already been passed in, the padding should be masked out in atten_weight, and theoretically it shouldn’t have an effect.

zhouzaida · December 28, 2023, 6:30am

@nielsr thanks for your help. After debugging the code, I found the key to the unexpected behavior (padding_side=‘right’) is the next_token comeing from the logit of pad token. I thought it would somehow get the logit of the last non-pad token as the predicted next token, but that’s not actually the case, it simply takes the last token (which could be a pad token).

        while True:
            if synced_gpus:
                # Under synced_gpus the `forward` call must continue until all gpus complete their sequence.
                # The following logic allows an early break if all peers finished generating their sequence
                this_peer_finished_flag = torch.tensor(0.0 if this_peer_finished else 1.0).to(input_ids.device)
                # send 0.0 if we finished, 1.0 otherwise
                dist.all_reduce(this_peer_finished_flag, op=dist.ReduceOp.SUM)
                # did all peers finish? the reduced sum will be 0.0 then
                if this_peer_finished_flag.item() == 0.0:
                    break

            # prepare model inputs
            model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs)

            # forward pass to get next token
            outputs = self(
                **model_inputs,
                return_dict=True,
                output_attentions=output_attentions,
                output_hidden_states=output_hidden_states,
            )

            if synced_gpus and this_peer_finished:
                continue  # don't waste resources running the code we don't need

            next_token_logits = outputs.logits[:, -1, :]

DoganK01 · May 17, 2024, 11:56pm

Hi dude, I couldnt quite understand the logic here

And one more thing: I saw this piece of code:

decided to pad on left side but with eos token ? Don’t the models automatically stop when they see eos tokens? Shouldn’t there be a problem here?

nielsr · May 20, 2024, 9:42am

Hi,

If models don’t have a padding token set one can use the EOS token as padding token, and pad from the left at inference time.

This is not an issue since the model will then see “<eos> <eos> <eos> (…) hello your name is” => then the model is prompted to continue the token “is”, so it will generate several new tokens until it will generate an EOS token.

DoganK01 · May 20, 2024, 9:39pm

is it like [EOS, EOS, EOS, Hello, your, name, is, … ]? Because in this format, model should stop since it sees the stop token. what is I’m missing ?

nielsr · May 21, 2024, 7:00am

Yes, sorry for Forum was hiding the <eos> tokens in my reply

DoganK01 · May 21, 2024, 11:37pm

I didnt understand, what is the specific reason to use EOS to do padding it? Why we using EOS? and why left side? isn’t it the case that model stops when it sees the EOS token generated from itsel? (for example [BOS] Hi, how are you? [EOS]). For this example, shouldnt the model just stop since the model generated [EOS] token when the model tokenized “?” ?

It makes sense to use the EOS token when we set the padding side = right. Likewise, we can also use BOS (begin of sentece) tokens for padding, right? And it makes sense when we set the padding side = left. What am I missing?

kalpanmukherjee · June 15, 2024, 6:23pm

@DoganK01 from what I understand what happens is the model sees -
[eos] - nothing to generate
[eos] [eos] - nothing to generate
[eos] [eos] hello - generates logits for after hello

hope this clears it up for you!

Boltzmachine · September 10, 2024, 4:52pm

I cannot understand why huggingface implement like this. Why don’t they extract the last non-pad tokens of each sample?

rlee002 · January 7, 2025, 2:45am

Adding onto here, I believe this is only for the generation side (inference side) of the model. So for fine-tuning an LLM, do we still keep the right padding or do we follow the same logic as for inference and keep the left padding?

MauroExtrac · April 17, 2025, 3:55pm

Did you ever find out?

DoganK01 · May 27, 2025, 12:35pm

Guys, I figured it out. Since models are decoder-only (autoregressive), its nonsense applying padding on right side. Because model predicts the next token by looking at last as you can figure this out @zhouzaida s last answer in this thread. And about model stopping predicting next token when it sees EOS, its just adjusting it in the code by telling model that it shouldnt focus on padding (EOS) tokens in the beginning and then should skip them. This is what I’ve figured out. But when we say model to skip those padding tokens, it shouldnt have any importance to set pad token to EOS or BOS. I dont have answer for the last one

Topic		Replies	Views
LLaMA2 - tokenizer padding affecting logits (even with attention_mask) 🤗Transformers	8	4540	March 26, 2024
Gemma-2 & Phi-3 SFT nuances Models	0	108	September 18, 2024
How does padding side affect training? 🤗Transformers	0	239	August 23, 2024
Qwen 'padding_side = right' problem Models	2	769	April 25, 2025
Results of model.generate are different for different batch sizes of the decode-only model Beginners	6	6007	April 14, 2024

The effect of padding_side

Related topics