Which (and how) Multi GPU strategy to use to train model with longer max_length (Phi-2 fits in Single GPU but qLoRa gives OOM with 512)?

By Strategy, I mean DDP, Tensor Parallel, Model Parallel, Pipeline Parallel etc etc and more importantly, how to use that strategy in HF Trainer to increase max_len

I’m trying to train Phi-2 whose Memory footbrint is 1.7GBs. I loaded the model with 4bit config, used paged_adam_8bit with Grad checkpointing. I have 8*A10 GPUs with 24GB each but when I try to train the model, it fails to even reach length of 512. I’m using HuggingFace trainer. what can be done?

With Single GPU, I can run the below code on a batch of 2 with 2048 length with Peak GPU usage of 19624MiB but with Multiple GPUs, it breaks at 512 length and Batch of 1

when I try to load it in device_map = "auto", Trainer throws error saying Can't train when model is in 8 bit in other device

Without that, using nvidia-smi gives me the GPU0 memory utilised 22524MiB while other 6 are just around 4384MiB. I think the model is not loaded properly. Could someone please help.

Here is my code:

model_name = "microsoft/phi-2"

tokenizer = AutoTokenizer.from_pretrained(
    model_name,
    padding_side="left",
    add_eos_token=True,
    add_bos_token=True,
    use_fast=False, # needed for now, should be fixed soon
)
tokenizer.pad_token = tokenizer.eos_token


bnb_config = BitsAndBytesConfig(load_in_4bit=True,
                                bnb_4bit_quant_type='nf4',
                                bnb_4bit_compute_dtype=torch.bfloat16,
                                # bnb_4bit_compute_dtype="float16",
                                bnb_4bit_use_double_quant=True)

model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True, quantization_config=bnb_config,)


model.gradient_checkpointing_enable() #gradient checkpointing to save memory

model = prepare_model_for_kbit_training(model, use_gradient_checkpointing=True) # Freeze base model layers and cast layernorm in fp32

lora_config = LoraConfig(
    r=256,
    lora_alpha=512,
    target_modules=[
    'q_proj','k_proj','v_proj','dense','fc1','fc2',], #print(model) will show the modules to use
    bias="none",
    lora_dropout=0.05,
    task_type="CAUSAL_LM")


model = get_peft_model(model, lora_config) # LORA

Here’s the training Code:

training_args = TrainingArguments(
    output_dir='./results',  # Output directory for checkpoints and predictions
    overwrite_output_dir=True, # Overwrite the content of the output directory
    per_device_train_batch_size=1,  # Batch size for training
    per_device_eval_batch_size=1,  # Batch size for evaluation
    gradient_accumulation_steps=1, # number of steps before optimizing
    gradient_checkpointing=True,   # Enable gradient checkpointing
    gradient_checkpointing_kwargs={"use_reentrant": False},
    warmup_steps=10,  # Number of warmup steps
    max_steps=5000,  # Total number of training steps
    num_train_epochs=3,  # Number of training epochs
    learning_rate=5e-5,  # Learning rate
    weight_decay=0.01,  # Weight decay
    optim="paged_adamw_8bit", #Keep the optimizer state and quantize it
    bf16=True, #Use mixed precision training
    
    #For logging and saving
    logging_dir='./logs',
    logging_strategy="steps",
    logging_steps=10,
    save_strategy="steps",
    save_steps=100,
    save_total_limit=2,  # Limit the total number of checkpoints
    evaluation_strategy="steps",
    eval_steps=100,
    load_best_model_at_end=True, # Load the best model at the end of training
    report_to = 'wandb',
    neftune_noise_alpha = 5,
)

trainer = Trainer(
    model = model,
    train_dataset=tokenized_train_dataset,
    eval_dataset=tokenized_val_dataset,
    args=training_args,
)

#Disable cache to prevent warning, renable for inference
model.config.use_cache = False