RuntimeError: CUDA out of memory. Tried to allocate 384.00 MiB (GPU 0; 11.17 GiB total capacity; 10.62 GiB already allocated; 145.81 MiB free; 10.66 GiB reserved in total by PyTorch)

Hi Huggingface team,

I am trying to fine-tune my MLM RoBERTa model on a binary classification dataset. I’m able to successfully tokenize my entire dataset, but during training, I keep getting the same CUDA memory error. I’m sure as to where the memory is taken up, but have attached the entire notebook here for reference.

Error message:

I suspect it has something to do with my train() method.

Does anyone have any thoughts on why the GPU memory is being almost entirely allocated to PyTorch? Any help is appreciated, thanks!

You can try lowering your batch size,reserved by pytorch means that the memory is used for the data, model , gradients etc

2 Likes
  • Try reducing per_device_train_batch_size.
  • If you don’t want to reduce it drastically, try reducing max_seq_length from 128 to a lower number if you think your sequences are not that long enough to fit 128 token space.
  • Use a smaller model like Albert v2. In your use case, maybe distilBERT would be decent. That would take less RAM to perform forward pass.
4 Likes

great @prajjwal1, thanks for the detailed answer

1 Like

thanks @prajjwal1 and @valhalla for the great answers! turns out it was batch size that did the trick! appreciate the help!!

Hi all,

I came across a very similar issue trying to train the MLM RoBERTa model using the train function. The setup is the following:

roberta_config = RobertaConfig(
        vocab_size=10000,
        max_position_embeddings=256,
        num_attention_heads=6,
        num_hidden_layers=3,
        type_vocab_size=1,
    )

roberta = RobertaForMaskeLM(config=roberta_config)

training_args = TrainingArguments(
    output_dir=output_path,
    logging_dir=logging_path,
    overwrite_output_dir=True,
    num_train_epochs=1,
    per_gpu_train_batch_size=256,
    save_steps=1_000,
    save_total_limit=2,
    prediction_loss_only=False,
    evaluation_strategy=EvaluationStrategy.EPOCH,
    do_train=True,
    do_eval=True,
    evaluate_during_training=True,
    logging_steps=10,
)

trainer = Trainer(
        model=roberta,
        args=training_args,
        data_collator=data_collator,
        train_dataset=smiles_training_dataset,
        eval_dataset=smiles_eval_dataset,
    )

trainer.train()

The error I get looks the same:

  File "/lib/python3.8/site-packages/transformers/trainer.py", line 775, in train
    tr_loss += self.training_step(model, inputs)
  File "/home/matthias/anaconda3/envs/chemtran/lib/python3.8/site-packages/transformers/trainer.py", line 1126, in training_step
    loss.backward()
  File "/lib/python3.8/site-packages/torch/tensor.py", line 221, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/lib/python3.8/site-packages/torch/autograd/__init__.py", line 130, in backward
    Variable._execution_engine.run_backward(
RuntimeError: CUDA out of memory. Tried to allocate 762.00 MiB (GPU 0; 7.93 GiB total capacity; 6.15 GiB already allocated; 340.06 MiB free; 6.94 GiB reserved in total by PyTorch)

It doesn’t appear imidiately though, but rather non-deterministicly far into the training, which rather points to a memory leak somewhere. Would you have some tips or ideas how to approach this?

@mkirmse did you ever figure out what was causing the error?