Best practice to train LLMs on long sequences?

Hello,

I am trying to fine tune gemma-2-2b on long sequences (4000-8000 tokens).
Working with bf16 and 4 bit QLoRA still doesn’t fit to single 24GB GPU.
I want to scale this up to multi-GPU instance but this means I have to split the model to several GPUs.
What’s the best practice to training LLMs on long sequences?

Thank you

1 Like