Significance of block size


Reducing the block size seems to be an effective way to fit models in memory. However, I am wondering what consequences this might have for the trained model. Especially if the model is fine-tuned on a smaller block size than used initially for pre-training.

Ping! Anyone? :blush: