CUDA Out of Memory while fine-tuning even with LoRA

I am facing an out-of-memory error when trying to fine-tune Gemma-2b for Sequence Classification. My code for it is below and

training_dataset = Dataset.from_pandas(training_df, split="train")
testing_dataset = Dataset.from_pandas(testing_df)

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=t.bfloat16,
    bnb_4bit_use_double_quant=True
)

# Setup the tokenizer
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID,
                                          padding_side="right",
                                          add_eos_token=True)
tokenizer.pad_token = tokenizer.eos_token

# Setup the model
model = AutoModelForSequenceClassification.from_pretrained(MODEL_ID,
                                                           num_labels=7,
                                                           quantization_config=bnb_config,
                                                           device_map={"":0})

def tokenize(batch):
    return tokenizer(batch['text'], truncation=True, max_length=512, padding="max_length")

from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer, return_tensors="pt")
     
training_dataset = training_dataset.map(tokenize, batched=True)
peft_config = LoraConfig(
    lora_alpha=16,
    lora_dropout=0.1,
    r=4,
    bias="none",
    task_type="SEQ_CLS",
    target_modules=["q_proj", "o_proj", "k_proj", "v_proj", "gate_proj", "up_proj"],
)

model = get_peft_model(model, peft_config)
training_arguments = TrainingArguments(
    output_dir="./model",
    learning_rate=2e-5,
    per_device_train_batch_size=128,
    per_device_eval_batch_size=128,
    weight_decay=0.01,
    logging_steps=100,
    save_steps=100,
)
trainer = Trainer(
    model=model,
    train_dataset=training_dataset,
    tokenizer=tokenizer,
    args=training_arguments,
)
trainer.train()

This is the error I get


File ~/.pyenv/versions/3.11.8/lib/python3.11/site-packages/torch/nn/modules/linear.py:116, in Linear.forward(self, input)
    115 def forward(self, input: Tensor) -> Tensor:
--> 116     return F.linear(input, self.weight, self.bias)

OutOfMemoryError: CUDA out of memory. Tried to allocate 4.00 GiB. GPU 0 has a total capacity of 39.39 GiB of which 3.60 GiB is free. Process 517440 has 35.78 GiB memory in use. Of the allocated memory 33.44 GiB is allocated by PyTorch, and 1.84 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

A look at the dataframes for the training and testing data is attached as an image. There are 3 features - text, label_text and label. I am using 7 labels indexed 1-7.

Choose a batch size that fits in memory. Also, consider using FP16 (mixed precision), Gradient Checkpointing, Gradient Accumulation.

Thanks for the sugestion, this is my new Training Argument setup (I’m using an A100 on Brev). This time at least it started training

training_arguments = TrainingArguments(
    output_dir="./model",
    learning_rate=2e-5,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    gradient_accumulation_steps=1,
    gradient_checkpointing=True,
    warmup_ratio=0.03,
    fp16=False,
    bf16=True,
    optim="paged_adamw_32bit",
    weight_decay=0.01,
    logging_steps=100,
    save_steps=100,
)

However, during training I get this weird error

os.environ['CUDA_LAUNCH_BLOCKING'] = 1

Run with this

Ahh, I figured out the error it was in my label to id mapping (it was 1-indexed not 0-indexed). This was why it said t< n_classes

Thanks for the help.

Glad you’ve got it sorted. On batchsize, ensure you are using as much of the available VRAM as possible whilst keeping your batchsize as a base 2 number (2, 4, 8, 16 etc). As this will yield faster training and many studies show it can have an impact on model performance

1 Like

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.