Finetuning LLama2-70B using 4-bit quantization on multi-GPU using Deepspeed ZeRO


I am trying to Finetune LLama2-70B 4-bit quantized on multi-GPU (3xA100 40GBs) using Deepspeed ZeRO-3.

At the moment, I am able to Finetune the 4-bit quantized model on the 3 GPUs using SFTTrainer ModelParallel (basically just device_map: auto).

And following the DeepSpeed Integration, what I understand is that adding a DeepSpeed config and running the script using deepspeed should have done the trick, but it doesn’t.

It still tries to load the model as ModelParallel (different layers on different GPUs) and also ignores the quantization config, trying to load it completely (which fails).

Can you please help me by directing me to some working examples or maybe something similar?

Thanks a lot in advance

Deepspeed Zero-3 doesn’t work with 4-bit quantization yet. They recently announced support for quantization down to 8-bit, for which you have to make changes to your deepspeed json config file adding a new ‘quantization’ section.