Significance of block size

Hi!

Reducing the block size seems to be an effective way to fit models in memory. However, I am wondering what consequences this might have for the trained model. Especially if the model is fine-tuned on a smaller block size than used initially for pre-training.

1 Like

Ping! Anyone? :blush:

I’m sorry I can’t answer your question, but could you please share how do you change the block size of a HuggingFace model?

From what I understand, for AutoTrain with SFT fine-tuning for example, the block size represents the desired attention window size for the input. As such, RAM usage will increase with this setting, similar to inference, but significantly more so with the computation of gradients. The implication is that if you want to fine-tune for an instruct task and your ((Instruction + Context) + Answer) is less than the block size, it might learn the Instruction, Context, and Answer independently. Probably not what you’re aiming for. You can reduce GPU RAM usage by decreasing the batch size and gradient accumulation steps too.