Padding side in instruction fine-tuning using SFTT

I am a little bit confused about the correct padding side that should be utilized while instruction fine-tuning Large Language Models.

Supervised Fine-tuning Trainer comes from tlr library. Most of the tutorials concerning instruction fine-tuning (e.g. a tutorial provided by the Technical Lead at HF - Philipp Schmid: LINK) attach a line of code with the change of padding side:

tokenizer.padding_side = 'right'

When this is done, then we get a warning message from transformers library:

A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.

Let’s assume that I am a polite student listetning to the warnings and I change the padding side to left, but after that another warning is thrown by tlr library:

UserWarning: You passed a tokenizer with `padding_side` not equal to `right` to the SFTTrainer. This might lead to some unexpected behaviour due to overflow issues when training a model in half-precision. You might consider adding `tokenizer.padding_side = 'right'` to your code.

This is super confusing to me because you basically cannot fine-tune without any warning, there is always something telling you that you should initialize the tokenizer in different way.

So my question is: what is the desired padding side for instruction fine-tuning and what are the use cases of left and right padding? I found multiple discussions on the Internet, however, the conclusions are contradictory (some state right padding, the other ones left padding). Even the HuggingFace website tells you to use left padding (generation), so why a person from HF preparing a tutorial used the right padding instead?

1 Like