I am trying to Finetune LLama2-70B 4-bit quantized on multi-GPU (3xA100 40GBs) using Deepspeed ZeRO-3.
At the moment, I am able to Finetune the 4-bit quantized model on the 3 GPUs using SFTTrainer ModelParallel (basically just device_map: auto).
And following the DeepSpeed Integration, what I understand is that adding a DeepSpeed config and running the script using deepspeed should have done the trick, but it doesn’t.
It still tries to load the model as ModelParallel (different layers on different GPUs) and also ignores the quantization config, trying to load it completely (which fails).
Can you please help me by directing me to some working examples or maybe something similar?
Thanks a lot in advance