Why isn't quantization config reducing memory usage?

gazm2k5 · August 16, 2024, 3:02pm

I’m training a model on 2x 3090’s using accelerate to handle multi gpu set up.

I’ve tried a DistilBERT model and am now trying a Longformers model for its increased input sequence length. However I’m running into memory issues.

# bnb_config = BitsAndBytesConfig(  
#     load_in_4bit=True,
#     bnb_4bit_quant_type= "nf4",
#     bnb_4bit_compute_dtype= torch.bfloat16,
#     bnb_4bit_use_double_quant= False,
# )

bnb_config = BitsAndBytesConfig(
    load_in_8bit=True,
    llm_int8_threshold=6.0, 
    llm_int8_skip_modules=None, 
)

model = DistilBertForSequenceClassification.from_pretrained(
    base_model_name, # load the base model and apply custom embedding layer
    # quantization_config=bnb_config,
    torch_dtype=torch.bfloat16,
    num_labels=len(id2label),
    id2label=id2label,
    label2id=label2id,
)

model.config.use_cache = False

# peft_config = LoraConfig(
#     task_type="SEQ_CLS", # sequence classification
#     lora_alpha=16,
#     lora_dropout=0.4,
#     r=8,
#     bias="none",
#     target_modules=["q_lin", "v_lin", "k_lin", "out_lin"]
# )

# model = get_peft_model(model, peft_config)
# model = prepare_model_for_kbit_training(model)

I’ve only just installed the 2nd GPU. I was expecting less GPU memory usage to spread (and therefore half) when using a second GPU, but I’m seeing strange memory usage. I’m using the Accelerate package (which I believe uses DDP to distribute GPU usage across multi GPUs).

Here are the results of some tests using DistilBERT and a batch size of 16:

Single/Multi GPU	Lora+k_bit	bitsnbytes	GPU1 Mem Usage	GPU2 Mem Usage
Single GPU	off	off	2.6	0.7
Single GPU	off	4bit	2.6	0.8
Single GPU	off	8bit	2.7	0.8
Single GPU	on	off	3.8	0.8
Single GPU	on	4bit	3.7	0.8
Single GPU	on	8bit	3.7	0.8
Accelerate	off	off	7	8.4
Accelerate	off	4bit	10.6	10.7
Accelerate	off	8bit	10.9	11.3
Accelerate	on	off	6.6	7.7
Accelerate	on	4bit	4.1	4.7
Accelerate	on	8bit	4.3	5

Many questions…

Why does using 2 GPUs increase the memory usage so much? It’s using double (or more) on each GPU, where I’d expect it to use around half the amount of each GPU.
Why does Lora increase memory usage with 1 GPU from ~2.7GB to 3.7GB?
Why does quantization with bnb increase memory when used without Lora on 2 GPUs?

I’m using the same data to train all models. The only difference is I’m commenting/uncommenting lines in the above code.

Topic		Replies	Views
Qunatized model with LORA takes much more GPU memory than the un-quantized model with LORA for the (E-5-Large Embedding Transformer) 🤗Transformers	4	1817	October 8, 2023
Load_in_8bit vs. loading 8-bit quantized model 🤗Transformers	6	7264	May 13, 2024
Question about memory usage Beginners	0	1004	May 15, 2023
QLoRA memory requirement with 3B model loads GPU with 10GB of memory with 4bit quantization Intermediate	0	1206	December 19, 2023
Loading llama3.21B in quantized config shows no change in size Beginners	1	66	December 10, 2024

Why isn't quantization config reducing memory usage?

Related topics