Fine Tuning LLama 3.2 1B Quantized Memory Requirements

Hi All!
I’m trying to fine tune a LLama 3.2 1B instruct model, that has been quantized during loading. But for some reason, the trainer errors out stating:
“OutOfMemoryError: CUDA out of memory. Tried to allocate 32.00 GiB. GPU 0 has a total capacity of 6.00 GiB of which 3.48 GiB is free.”
I’m not sure why this is the case that training a relatively small model is requiring 32GB of VRAM.
Would really appreciate any help if possible, code attached below:

quantization_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_compute_dtype=torch.float16,
    )
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B-Instruct")
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-1B-Instruct", num_labels = 23, quantization_config=quantization_config)
lora_config = LoraConfig(
    r=4,
    lora_alpha=8,
    lora_dropout=0.1,
    bias="none",
    task_type=TaskType.CAUSAL_LM,
    target_modules = ["q_proj", "k_proj", "v_proj"]
)
model = get_peft_model(model, lora_config)

training_args = TrainingArguments(
    output_dir='./results',          
    num_train_epochs=1,              
    per_device_train_batch_size=1,   
    per_device_eval_batch_size=16,   
    warmup_steps=500,                
    weight_decay=0.01,               
    logging_dir='./logs',            
    logging_steps=10,
    evaluation_strategy="epoch",     
    gradient_accumulation_steps=4,  
)

trainer = Trainer(
    model=model,                         
    args=training_args,                  
    train_dataset=train_dataset,         
    eval_dataset=eval_dataset,
1 Like

With the given information, it’s hard to tell what the reason is for why so much memory is needed, there is nothing obviously wrong there. It would be helpful if you could share the full code and the full error message.

One common source of excessive memory usage can be in the data itself. If you have very long sequences, because of the quadratic memory requirement, OOMs can easily occur. You could check if setting a low max_seq_length helps to curb the memory usage.

1 Like

Running LLaMA 3.2 locally requires adequate computational resources. Below are the recommended specifications:

Hardware:

GPU: NVIDIA GPU with CUDA support (16GB VRAM or higher recommended).
RAM: At least 32GB (64GB for larger models).
Storage: Minimum 50GB of free disk space for the model and dependencies.
Software:

Operating System: Linux (preferred), macOS, or Windows.
Python: Version 3.8 or higher.
CUDA Toolkit: Required for GPU acceleration (11.6 or newer).

1 Like

Has anyone fixed this yet? I’m using a AMD Ryzen 5 4600G with Radeon Graphics on my ubuntu server.
I’ve got 64GB of ram, 2 asus Geforce rtx ti 16 GB 4060 gpus and 4 1T nvme ssds. I still keep getting those errors. I’ve tried configuring accelerate at the begining of the script, setting every environment variable I can find, and the CUCA still demands 32 GB.

1 Like

As BenjaminB also mentioned, it seems unusual that this model size and training parameters require this amount of RAM.
Therefore, there may be issues with the library or model, and the cause may lie in parts other than the visible parameters.

However, if this were a common issue, there would likely be many reports of it. It may be specific to certain conditions, such as a multi-GPU environment, a specific version of the library, or specific hardware.

I’m wondering it maybe has something to do with using a local dataset instead of the hub, but that’s just a desperate conclusion after many hours of struggling with this issue. I’ll check out the link you provided hopefully that will provide some insight. Thanks.

1 Like

I see, the dataset could also be a possible cause…
Well, the best practices for datasets are probably available in this forum or on GitHub if you search for them…:sweat_smile:

Also, depending on the model, gradient checking may not be available (I think it should be available in Llama 3.2 1B, though…), and there may still be some potential bugs in multi-GPU environments.

When trying to isolate the issue, it’s usually faster to temporarily switch to a smaller, simpler model or dataset.