I’m currently attempting to fine tune a Llama2 7B LoRA (not QLoRA, I’m loading model in fp16) using PEFT / Transformers with DeepSpeed and I have to use CPU offload + a small batch size, so training is extremely slow. GPU utilization oscillates between 0% and 100% so the offloading seems to be taking up a lot of cycles.
My goal is to understand what is possible with A6000s when it comes to fine tuning. What can you realistically train (ie in a reasonable timeframe) with 2x A6000s? Is it possible to fine tune a 4k / 8k / 32k context window LoRA without quantizing in a reasonable amount of time? If so, any tutorials / examples / explainers would be really helpful. If not, what is realistically possible with these cards?