Why isn't quantization config reducing memory usage?

I’m training a model on 2x 3090’s using accelerate to handle multi gpu set up.

I’ve tried a DistilBERT model and am now trying a Longformers model for its increased input sequence length. However I’m running into memory issues.

# bnb_config = BitsAndBytesConfig(  
#     load_in_4bit=True,
#     bnb_4bit_quant_type= "nf4",
#     bnb_4bit_compute_dtype= torch.bfloat16,
#     bnb_4bit_use_double_quant= False,
# )

bnb_config = BitsAndBytesConfig(
    load_in_8bit=True,
    llm_int8_threshold=6.0, 
    llm_int8_skip_modules=None, 
)

model = DistilBertForSequenceClassification.from_pretrained(
    base_model_name, # load the base model and apply custom embedding layer
    # quantization_config=bnb_config,
    torch_dtype=torch.bfloat16,
    num_labels=len(id2label),
    id2label=id2label,
    label2id=label2id,
)

model.config.use_cache = False

# peft_config = LoraConfig(
#     task_type="SEQ_CLS", # sequence classification
#     lora_alpha=16,
#     lora_dropout=0.4,
#     r=8,
#     bias="none",
#     target_modules=["q_lin", "v_lin", "k_lin", "out_lin"]
# )

# model = get_peft_model(model, peft_config)
# model = prepare_model_for_kbit_training(model)

I’ve only just installed the 2nd GPU. I was expecting less GPU memory usage to spread (and therefore half) when using a second GPU, but I’m seeing strange memory usage. I’m using the Accelerate package (which I believe uses DDP to distribute GPU usage across multi GPUs).

Here are the results of some tests using DistilBERT and a batch size of 16:

Single/Multi GPU Lora+k_bit bitsnbytes GPU1 Mem Usage GPU2 Mem Usage
Single GPU off off 2.6 0.7
Single GPU off 4bit 2.6 0.8
Single GPU off 8bit 2.7 0.8
Single GPU on off 3.8 0.8
Single GPU on 4bit 3.7 0.8
Single GPU on 8bit 3.7 0.8
Accelerate off off 7 8.4
Accelerate off 4bit 10.6 10.7
Accelerate off 8bit 10.9 11.3
Accelerate on off 6.6 7.7
Accelerate on 4bit 4.1 4.7
Accelerate on 8bit 4.3 5

Many questions…

  1. Why does using 2 GPUs increase the memory usage so much? It’s using double (or more) on each GPU, where I’d expect it to use around half the amount of each GPU.
  2. Why does Lora increase memory usage with 1 GPU from ~2.7GB to 3.7GB?
  3. Why does quantization with bnb increase memory when used without Lora on 2 GPUs?

I’m using the same data to train all models. The only difference is I’m commenting/uncommenting lines in the above code.