CUDA Out of Memory Error When Training Specific Layers

MaximeChampagne · November 1, 2024, 10:26pm

Hello everyone!

I am trying to train the mistralai/Mistral-Nemo-Instruct-2407 model using the Hugging Face Trainer. My goal is to train layers 33 to 39 (7 layers) of the model. The model, when fully loaded, takes around 45 GiB of vRAM. I am using an Nvidia H100 GPU with 80 GiB of vRAM, so it should have enough vRAM as I am only training the last 7 layers.

However, I am encountering a CUDA Out of Memory Error after a couple of steps. Is there anything wrong with my setup?

Here is my complete script:

import torch
from train_data_v2 import get_data
from datasets import Dataset
from transformers import MistralForCausalLM, AutoTokenizer, Trainer, TrainingArguments

def tokenize_function(example, tokenizer):
  max_len = 2400

  # Tokenize input
  model_inputs = tokenizer(
      example["input"],
      padding="max_length",
      truncation=True,
      max_length=max_len
  )
  
  # Tokenize output/labels
  with tokenizer.as_target_tokenizer():
      labels = tokenizer(
          example["output"],
          padding="max_length",
          truncation=True,
          max_length=max_len
      )

  # Replace padding token id with -100 for labels
  labels_tensor = torch.tensor(
      [label if label != tokenizer.pad_token_id else -100 for label in labels["input_ids"]],
      dtype=torch.long
  )

  return {
      "input_ids": torch.tensor(model_inputs["input_ids"], dtype=torch.long),
      "attention_mask": torch.tensor(model_inputs["attention_mask"], dtype=torch.long),
      "labels": labels_tensor
  }

def main():
  # Initialize tokenizer
  tokenizer = AutoTokenizer.from_pretrained(
      "mistralai/Mistral-Nemo-Instruct-2407",
      token="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",
      padding_side="right"
  )
  
  tokenizer.pad_token = tokenizer.eos_token

  # Load model
  model = MistralForCausalLM.from_pretrained(
      "mistralai/Mistral-Nemo-Instruct-2407", 
      token="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",
      torch_dtype=torch.float16
  ).to("cuda")

  # Freeze all parameters first
  for param in model.parameters():
      param.requires_grad = False

  # Unfreeze specific layers
  layers_to_unfreeze = list(range(33, 39))
  for i in layers_to_unfreeze:
      for param in model.model.layers[i].parameters():
          param.requires_grad = True

  # Load and process dataset
  dataset = Dataset.from_list(get_data(tokenizer)) # Here I use a custom function to load my data, the tokenizer is used for adding special tokens at the right place.
  tokenized_dataset = dataset.map(
      lambda x: tokenize_function(x, tokenizer),
      remove_columns=dataset.column_names,
      batched=False,
      num_proc=1
  )

  print("Data Loaded and tokenized!")

  # Training arguments
  training_args = TrainingArguments(
      output_dir="./results",
      evaluation_strategy="no",
      per_device_train_batch_size=1,
      gradient_accumulation_steps=8,
      num_train_epochs=3,
      learning_rate=1e-6,
      weight_decay=0.01,
      warmup_steps=40,
      logging_dir='./logs',
      logging_steps=50,
      fp16=True
  )

  # Initialize trainer
  trainer = Trainer(
      model=model,
      args=training_args,
      train_dataset=tokenized_dataset,
  )

  # Start training
  print("Starting Training")
  trainer.train()
  print("Training Completed!")

  # Save the model
  trainer.save_model("./final_model")
  tokenizer.save_pretrained("./final_tokenizer")
  print("Model and Tokenizer saved!")

if __name__ == "__main__":
  main()

I also tried to set model.gradient_checkpointing_enable() to enable checkpointing (which should save memory usage) using this code:

for param in model.parameters():
    param.requires_grad = True  # Temporarily enable gradients

model.gradient_checkpointing_enable()  # Enable checkpointing

# Freeze all parameters first
for param in model.parameters():
    param.requires_grad = False

# Unfreeze specific layers
layers_to_unfreeze = list(range(33, 39))
for i in layers_to_unfreeze:
    for param in model.model.layers[i].parameters():
        param.requires_grad = True

And also by adding gradient_checkpointing=True to the training_args.

But I am getting this error:
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn.

Thank you!

John6666 · November 2, 2024, 1:27am

That’s strange, it seems to have enough VRAM, but is there a memory leak somewhere?
@MaxChamp By the way, is your token leaking? It’s in the code…

Pankaj8922 · November 2, 2024, 2:52am

Did you try dropping “max_len” to a lower value? Try 512 or 1024 for once, and see if it works.

Pankaj8922 · November 2, 2024, 3:07am

If reducing ‘max_len’ doesn’t resolve the issue, then try enabling gradient checkpointing (in the correct sequence)

# Enable gradient checkpointing
model.gradient_checkpointing_enable()

# Freeze all parameters first
for param in model.parameters():
    param.requires_grad = False

# Unfreeze specific layers
layers_to_unfreeze = list(range(33, 39))
for i in layers_to_unfreeze:
    for param in model.model.layers[i].parameters():
        param.requires_grad = True

MaximeChampagne · November 2, 2024, 3:24am

Hi, thanks for the replies!
For my task decreasing the max_len is not possible as we would miss the objective of the fine-tuning. As for the corrected code @Pankaj8922 has given I will try that and get back to you.
Thanks!

Pankaj8922 · November 2, 2024, 4:19am

As you said “would miss the objective of the fine-tuning” I am not saying you to fine tune the model on 512 or 1024 ‘max length’, but by this you will get the idea what’s wrong? Like if training could run on that max length, it means that’s your GPU’s “limitations” causing OOM.

MaximeChampagne · November 2, 2024, 12:25pm

Hi everyone,

The issue was that model.gradient_checkpointing_enable() could not work do some misconfiguration. But with the next config it worked for me:

After layer freezing, call:

model.gradient_checkpointing_enable()
model.enable_input_require_grads()

Also. add use_cache=False when loading the model.
Finally, add gradient_checkpointing_kwargs={"use_reentrant":False} to the training arguments.

When starting training I am bairly using 42 GiB of vram!

Thanks for the support!

Topic		Replies	Views
CUDA OOM error when using data-distributed mode on AWS p4d.24xlarge instance Beginners	6	289	April 11, 2024
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 256.00 MiB (GPU 0; 39.56 GiB total capacity; 37.84 GiB already allocated; 242.56 MiB free; 37.96 GiB reserved in total by PyTorch) 🤗Transformers	2	4830	June 7, 2023
Multi GPU Training with Trainer and TokenClassification Model 🤗Transformers	0	1441	July 21, 2023
CUDA OUT OF MEMORY on MULTI GPU 🤗Transformers	0	646	February 28, 2024
Always getting RuntimeError: CUDA out of memory with Trainer 🤗Transformers	10	6576	April 4, 2024

CUDA Out of Memory Error When Training Specific Layers

Related topics