VRAM Usage Differences in SageMaker Training Jobs vs. Direct Instance for Fine-Tuning LLama3 8B with QLoRA

ms3141592 · October 18, 2024, 6:49am

Hi, everyone.

I’m currently attempting to fine-tune meta-llama/Meta-Llama-3-8B-Instruct model using QLoRA. I successfully completed the fine-tuning using an Amazon SageMaker Training Job on a ml.g5.8xlarge instance.

However, when I ran the same code on a ml.p4d.24xlarge instance outside of a Training Job, I encountered a CUDA: out of memory error and was unable to complete the fine-tuning. Is it possible that SageMaker Training Jobs significantly reduce VRAM usage? Additionally, how much VRAM is required to fine-tune meta-llama/Meta-Llama-3-8B-Instruct mode using QLoRA with 4-bit or 8-bit quantization?
I would greatly appreciate any help you can provide.

< model & LoRA config >


model = AutoModelForCausalLM.from_pretrained(
        args.model_id,
        use_cache=False if args.gradient_checkpointing else True,  
        device_map="auto",
        load_in_8bit=True,
)

peft_config = LoraConfig(
        task_type=TaskType.CAUSAL_LM,
        inference_mode=False,
        r=16,
        lora_alpha=32,
        lora_dropout=0.05,
        target_modules=[
            "q_proj",
            "up_proj",
            "o_proj",
            "k_proj",
            "down_proj",
            "gate_proj",
            "v_proj"
         ]
)

model = prepare_model_for_int8_training(model)
model = get_peft_model(model, peft_config)

Topic		Replies	Views
LoRA / QLoRA fine tuning a 8b Model(llama 3.1) Beginners	1	298	February 24, 2025
Fine Tuning LLama 3.2 1B Quantized Memory Requirements Models	6	1420	June 16, 2025
Memory consumption qlora with gradient checkpointing 🤗Transformers	0	424	January 28, 2024
LoRa fine tuning a chatbot on 6GB VRAM GPU Beginners	1	302	January 21, 2025
LLama3-8B - FSDP + QLORA results in OOM with 4 A40's 🤗Accelerate	1	860	June 17, 2024

VRAM Usage Differences in SageMaker Training Jobs vs. Direct Instance for Fine-Tuning LLama3 8B with QLoRA

Related topics