I’m training a model on 2x 3090’s using accelerate to handle multi gpu set up.
I’ve tried a DistilBERT model and am now trying a Longformers model for its increased input sequence length. However I’m running into memory issues.
# bnb_config = BitsAndBytesConfig(
# load_in_4bit=True,
# bnb_4bit_quant_type= "nf4",
# bnb_4bit_compute_dtype= torch.bfloat16,
# bnb_4bit_use_double_quant= False,
# )
bnb_config = BitsAndBytesConfig(
load_in_8bit=True,
llm_int8_threshold=6.0,
llm_int8_skip_modules=None,
)
model = DistilBertForSequenceClassification.from_pretrained(
base_model_name, # load the base model and apply custom embedding layer
# quantization_config=bnb_config,
torch_dtype=torch.bfloat16,
num_labels=len(id2label),
id2label=id2label,
label2id=label2id,
)
model.config.use_cache = False
# peft_config = LoraConfig(
# task_type="SEQ_CLS", # sequence classification
# lora_alpha=16,
# lora_dropout=0.4,
# r=8,
# bias="none",
# target_modules=["q_lin", "v_lin", "k_lin", "out_lin"]
# )
# model = get_peft_model(model, peft_config)
# model = prepare_model_for_kbit_training(model)
I’ve only just installed the 2nd GPU. I was expecting less GPU memory usage to spread (and therefore half) when using a second GPU, but I’m seeing strange memory usage. I’m using the Accelerate package (which I believe uses DDP to distribute GPU usage across multi GPUs).
Here are the results of some tests using DistilBERT and a batch size of 16:
Single/Multi GPU | Lora+k_bit | bitsnbytes | GPU1 Mem Usage | GPU2 Mem Usage |
---|---|---|---|---|
Single GPU | off | off | 2.6 | 0.7 |
Single GPU | off | 4bit | 2.6 | 0.8 |
Single GPU | off | 8bit | 2.7 | 0.8 |
Single GPU | on | off | 3.8 | 0.8 |
Single GPU | on | 4bit | 3.7 | 0.8 |
Single GPU | on | 8bit | 3.7 | 0.8 |
Accelerate | off | off | 7 | 8.4 |
Accelerate | off | 4bit | 10.6 | 10.7 |
Accelerate | off | 8bit | 10.9 | 11.3 |
Accelerate | on | off | 6.6 | 7.7 |
Accelerate | on | 4bit | 4.1 | 4.7 |
Accelerate | on | 8bit | 4.3 | 5 |
Many questions…
- Why does using 2 GPUs increase the memory usage so much? It’s using double (or more) on each GPU, where I’d expect it to use around half the amount of each GPU.
- Why does Lora increase memory usage with 1 GPU from ~2.7GB to 3.7GB?
- Why does quantization with bnb increase memory when used without Lora on 2 GPUs?
I’m using the same data to train all models. The only difference is I’m commenting/uncommenting lines in the above code.