RuntimeError: CUDA out of memory. Tried to allocate 384.00 MiB (GPU 0; 11.17 GiB total capacity; 10.62 GiB already allocated; 145.81 MiB free; 10.66 GiB reserved in total by PyTorch)

seyonec · July 23, 2020, 3:55pm

Hi Huggingface team,

I am trying to fine-tune my MLM RoBERTa model on a binary classification dataset. I’m able to successfully tokenize my entire dataset, but during training, I keep getting the same CUDA memory error. I’m sure as to where the memory is taken up, but have attached the entire notebook here for reference.

Error message:

I suspect it has something to do with my train() method.

Does anyone have any thoughts on why the GPU memory is being almost entirely allocated to PyTorch? Any help is appreciated, thanks!

valhalla · July 23, 2020, 4:52pm

You can try lowering your batch size,reserved by pytorch means that the memory is used for the data, model , gradients etc

prajjwal1 · July 24, 2020, 5:49am

Try reducing per_device_train_batch_size.
If you don’t want to reduce it drastically, try reducing max_seq_length from 128 to a lower number if you think your sequences are not that long enough to fit 128 token space.
Use a smaller model like Albert v2. In your use case, maybe distilBERT would be decent. That would take less RAM to perform forward pass.

valhalla · July 24, 2020, 6:32am

great @prajjwal1, thanks for the detailed answer

seyonec · August 11, 2020, 2:08am

thanks @prajjwal1 and @valhalla for the great answers! turns out it was batch size that did the trick! appreciate the help!!

mkirmse · December 1, 2020, 8:30am

Hi all,

I came across a very similar issue trying to train the MLM RoBERTa model using the train function. The setup is the following:

roberta_config = RobertaConfig(
        vocab_size=10000,
        max_position_embeddings=256,
        num_attention_heads=6,
        num_hidden_layers=3,
        type_vocab_size=1,
    )

roberta = RobertaForMaskeLM(config=roberta_config)

training_args = TrainingArguments(
    output_dir=output_path,
    logging_dir=logging_path,
    overwrite_output_dir=True,
    num_train_epochs=1,
    per_gpu_train_batch_size=256,
    save_steps=1_000,
    save_total_limit=2,
    prediction_loss_only=False,
    evaluation_strategy=EvaluationStrategy.EPOCH,
    do_train=True,
    do_eval=True,
    evaluate_during_training=True,
    logging_steps=10,
)

trainer = Trainer(
        model=roberta,
        args=training_args,
        data_collator=data_collator,
        train_dataset=smiles_training_dataset,
        eval_dataset=smiles_eval_dataset,
    )

trainer.train()

The error I get looks the same:

  File "/lib/python3.8/site-packages/transformers/trainer.py", line 775, in train
    tr_loss += self.training_step(model, inputs)
  File "/home/matthias/anaconda3/envs/chemtran/lib/python3.8/site-packages/transformers/trainer.py", line 1126, in training_step
    loss.backward()
  File "/lib/python3.8/site-packages/torch/tensor.py", line 221, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/lib/python3.8/site-packages/torch/autograd/__init__.py", line 130, in backward
    Variable._execution_engine.run_backward(
RuntimeError: CUDA out of memory. Tried to allocate 762.00 MiB (GPU 0; 7.93 GiB total capacity; 6.15 GiB already allocated; 340.06 MiB free; 6.94 GiB reserved in total by PyTorch)

It doesn’t appear imidiately though, but rather non-deterministicly far into the training, which rather points to a memory leak somewhere. Would you have some tips or ideas how to approach this?

RylanSchaeffer · September 30, 2021, 7:24pm

@mkirmse did you ever figure out what was causing the error?

echchaiz · June 14, 2023, 3:25pm

Just delete save_steps and it will work … but I don’t know why !!!

Raisa06 · December 10, 2023, 1:41pm

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 4.28 GiB (GPU 0; 14.75 GiB total capacity; 6.83 GiB already allocated; 2.18 GiB free; 12.37 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

I’m hyper tuning bert-multilingual-uncased model for NER use case. I’m using AWS EC2 instance g4dn.metal which has 8 GPU. My training sample contact 110k samples. I tried the model training instance having 4 GPU I got CUDA out of memory error.

model = AutoModelForTokenClassification.from_pretrained("bert-multilingual-uncased", id2label=idx2label, label2id=label2idx)


training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["validation"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics
)

Tried reducing batch size, clear cache, setting up max_split_size (pytorch memory management) didn’t fix the error. So I started the model in bigger instance with 8 GPU still facing the same error.

Can someone please help me out in this?

Topic		Replies	Views
CUDA is out of memory Beginners	3	3306	October 9, 2023
Run_mlm.py cuda error memory after resuming a training 🤗Transformers	4	2905	April 21, 2021
Cuda out of memory error Intermediate	11	41760	January 27, 2025
Always getting RuntimeError: CUDA out of memory with Trainer 🤗Transformers	10	6907	April 4, 2024
RuntimeError: CUDA out of memory even with simple inference Beginners	1	5372	January 16, 2022

RuntimeError: CUDA out of memory. Tried to allocate 384.00 MiB (GPU 0; 11.17 GiB total capacity; 10.62 GiB already allocated; 145.81 MiB free; 10.66 GiB reserved in total by PyTorch)

Related topics