Finetuning LLama2-70B using 4-bit quantization on multi-GPU using Deepspeed ZeRO

prakhar19 · October 5, 2023, 5:21pm

Hello,

I am trying to Finetune LLama2-70B 4-bit quantized on multi-GPU (3xA100 40GBs) using Deepspeed ZeRO-3.

At the moment, I am able to Finetune the 4-bit quantized model on the 3 GPUs using SFTTrainer ModelParallel (basically just device_map: auto).

And following the DeepSpeed Integration, what I understand is that adding a DeepSpeed config and running the script using deepspeed should have done the trick, but it doesn’t.

It still tries to load the model as ModelParallel (different layers on different GPUs) and also ignores the quantization config, trying to load it completely (which fails).

Can you please help me by directing me to some working examples or maybe something similar?

Thanks a lot in advance

asherisaac · March 19, 2024, 7:50am

Deepspeed Zero-3 doesn’t work with 4-bit quantization yet. They recently announced support for quantization down to 8-bit, for which you have to make changes to your deepspeed json config file adding a new ‘quantization’ section.

Topic		Replies	Views
Deepspeed inference and infinity offload with bitsandbytes 4bit loaded models DeepSpeed	2	3858	July 27, 2023
Deepspeed ZeRO2, PEFT, bitsnbytes training DeepSpeed	0	128	June 4, 2024
Finetuning 4bit model Beginners	1	2430	August 29, 2023
Deepspeed inference stage 3 + quantization DeepSpeed	0	1010	March 8, 2024
CUDA OOM with deepspeed - torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 224.00 MiB. GPU 0 has a total capacity of 47.40 GiB of which 209.12 MiB is free Beginners	0	180	December 14, 2024

Finetuning LLama2-70B using 4-bit quantization on multi-GPU using Deepspeed ZeRO

Related topics