Padding side in instruction fine-tuning using SFTT

dkoterwa · October 23, 2024, 9:48am

I am a little bit confused about the correct padding side that should be utilized while instruction fine-tuning Large Language Models.

Supervised Fine-tuning Trainer comes from tlr library. Most of the tutorials concerning instruction fine-tuning (e.g. a tutorial provided by the Technical Lead at HF - Philipp Schmid: LINK) attach a line of code with the change of padding side:

tokenizer.padding_side = 'right'

When this is done, then we get a warning message from transformers library:

A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.

Let’s assume that I am a polite student listetning to the warnings and I change the padding side to left, but after that another warning is thrown by tlr library:

UserWarning: You passed a tokenizer with `padding_side` not equal to `right` to the SFTTrainer. This might lead to some unexpected behaviour due to overflow issues when training a model in half-precision. You might consider adding `tokenizer.padding_side = 'right'` to your code.

This is super confusing to me because you basically cannot fine-tune without any warning, there is always something telling you that you should initialize the tokenizer in different way.

So my question is: what is the desired padding side for instruction fine-tuning and what are the use cases of left and right padding? I found multiple discussions on the Internet, however, the conclusions are contradictory (some state right padding, the other ones left padding). Even the HuggingFace website tells you to use left padding (generation), so why a person from HF preparing a tutorial used the right padding instead?

cgr71ii · December 9, 2024, 1:21pm

I found a clear answer: “the padding side should be left when generating and right when training/tuning” (setting of padding_side in Llama tokenizers · Issue #34842 · huggingface/transformers · GitHub)

However, I asked for further explanations because I don’t totally understand why different padding side should be applied for training and inference (setting of padding_side in Llama tokenizers · Issue #34842 · huggingface/transformers · GitHub).

Topic		Replies	Views
The effect of padding_side 🤗Transformers	13	15222	May 27, 2025
How does padding side affect training? 🤗Transformers	0	245	August 23, 2024
Gemma-2 & Phi-3 SFT nuances Models	0	109	September 18, 2024
BioGPT error with right padding 🤗Transformers	0	574	August 7, 2023
T5 instruction finetuning Models	0	48	September 9, 2024

Padding side in instruction fine-tuning using SFTT

Related topics