RuntimeError: CUDA out of memory. Tried to allocate 384.00 MiB (GPU 0; 11.17 GiB total capacity; 10.62 GiB already allocated; 145.81 MiB free; 10.66 GiB reserved in total by PyTorch)

Hi Huggingface team,

I am trying to fine-tune my MLM RoBERTa model on a binary classification dataset. I’m able to successfully tokenize my entire dataset, but during training, I keep getting the same CUDA memory error. I’m sure as to where the memory is taken up, but have attached the entire notebook here for reference.

Error message:

I suspect it has something to do with my train() method.

Does anyone have any thoughts on why the GPU memory is being almost entirely allocated to PyTorch? Any help is appreciated, thanks!

You can try lowering your batch size,reserved by pytorch means that the memory is used for the data, model , gradients etc

2 Likes
  • Try reducing per_device_train_batch_size.
  • If you don’t want to reduce it drastically, try reducing max_seq_length from 128 to a lower number if you think your sequences are not that long enough to fit 128 token space.
  • Use a smaller model like Albert v2. In your use case, maybe distilBERT would be decent. That would take less RAM to perform forward pass.
4 Likes

great @prajjwal1, thanks for the detailed answer

1 Like

thanks @prajjwal1 and @valhalla for the great answers! turns out it was batch size that did the trick! appreciate the help!!

Hi all,

I came across a very similar issue trying to train the MLM RoBERTa model using the train function. The setup is the following:

roberta_config = RobertaConfig(
        vocab_size=10000,
        max_position_embeddings=256,
        num_attention_heads=6,
        num_hidden_layers=3,
        type_vocab_size=1,
    )

roberta = RobertaForMaskeLM(config=roberta_config)

training_args = TrainingArguments(
    output_dir=output_path,
    logging_dir=logging_path,
    overwrite_output_dir=True,
    num_train_epochs=1,
    per_gpu_train_batch_size=256,
    save_steps=1_000,
    save_total_limit=2,
    prediction_loss_only=False,
    evaluation_strategy=EvaluationStrategy.EPOCH,
    do_train=True,
    do_eval=True,
    evaluate_during_training=True,
    logging_steps=10,
)

trainer = Trainer(
        model=roberta,
        args=training_args,
        data_collator=data_collator,
        train_dataset=smiles_training_dataset,
        eval_dataset=smiles_eval_dataset,
    )

trainer.train()

The error I get looks the same:

  File "/lib/python3.8/site-packages/transformers/trainer.py", line 775, in train
    tr_loss += self.training_step(model, inputs)
  File "/home/matthias/anaconda3/envs/chemtran/lib/python3.8/site-packages/transformers/trainer.py", line 1126, in training_step
    loss.backward()
  File "/lib/python3.8/site-packages/torch/tensor.py", line 221, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/lib/python3.8/site-packages/torch/autograd/__init__.py", line 130, in backward
    Variable._execution_engine.run_backward(
RuntimeError: CUDA out of memory. Tried to allocate 762.00 MiB (GPU 0; 7.93 GiB total capacity; 6.15 GiB already allocated; 340.06 MiB free; 6.94 GiB reserved in total by PyTorch)

It doesn’t appear imidiately though, but rather non-deterministicly far into the training, which rather points to a memory leak somewhere. Would you have some tips or ideas how to approach this?

@mkirmse did you ever figure out what was causing the error?

Just delete save_steps and it will work … but I don’t know why !!!

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 4.28 GiB (GPU 0; 14.75 GiB total capacity; 6.83 GiB already allocated; 2.18 GiB free; 12.37 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

I’m hyper tuning bert-multilingual-uncased model for NER use case. I’m using AWS EC2 instance g4dn.metal which has 8 GPU. My training sample contact 110k samples. I tried the model training instance having 4 GPU I got CUDA out of memory error.

model = AutoModelForTokenClassification.from_pretrained("bert-multilingual-uncased", id2label=idx2label, label2id=label2idx)


training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["validation"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics
)

Tried reducing batch size, clear cache, setting up max_split_size (pytorch memory management) didn’t fix the error. So I started the model in bigger instance with 8 GPU still facing the same error.

Can someone please help me out in this?