VRAM Usage Differences in SageMaker Training Jobs vs. Direct Instance for Fine-Tuning LLama3 8B with QLoRA

Hi, everyone.

I’m currently attempting to fine-tune meta-llama/Meta-Llama-3-8B-Instruct model using QLoRA. I successfully completed the fine-tuning using an Amazon SageMaker Training Job on a ml.g5.8xlarge instance.

However, when I ran the same code on a ml.p4d.24xlarge instance outside of a Training Job, I encountered a CUDA: out of memory error and was unable to complete the fine-tuning. Is it possible that SageMaker Training Jobs significantly reduce VRAM usage? Additionally, how much VRAM is required to fine-tune meta-llama/Meta-Llama-3-8B-Instruct mode using QLoRA with 4-bit or 8-bit quantization?
I would greatly appreciate any help you can provide.

< model & LoRA config >


model = AutoModelForCausalLM.from_pretrained(
        args.model_id,
        use_cache=False if args.gradient_checkpointing else True,  
        device_map="auto",
        load_in_8bit=True,
)

peft_config = LoraConfig(
        task_type=TaskType.CAUSAL_LM,
        inference_mode=False,
        r=16,
        lora_alpha=32,
        lora_dropout=0.05,
        target_modules=[
            "q_proj",
            "up_proj",
            "o_proj",
            "k_proj",
            "down_proj",
            "gate_proj",
            "v_proj"
         ]
)

model = prepare_model_for_int8_training(model)
model = get_peft_model(model, peft_config)
1 Like