Always getting RuntimeError: CUDA out of memory with Trainer

olaffson · December 18, 2021, 4:08pm

Hello,

I am using huggingface on my google colab pro+ instance, and I keep getting errors like

RuntimeError: CUDA out of memory. Tried to allocate 256.00 MiB (GPU 0; 15.78 GiB total capacity; 13.92 GiB already allocated; 206.75 MiB free; 13.94 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

I dont understand why? My dataset is microscopic (40K sentences), and all I am doing is loading bert-large-uncased and follow along the text classification notebook

from transformers import AutoTokenizer
from transformers import AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained('bert-large-cased')
from datasets import load_dataset, load_metric

metric = load_metric('glue', 'sst2')
model = AutoModelForSequenceClassification.from_pretrained("bert-large-cased", num_labels=2)

my trainer args are super standard


from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

batch_size = 16

args = TrainingArguments(
    evaluation_strategy = "epoch",
    save_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=5,
    report_to="none",
    weight_decay=0.01,
     output_dir='/content/drive/MyDrive/kaggle/',
    metric_for_best_model='accuracy')

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = predictions[:, 0]
    return metric.compute(predictions=predictions, references=labels)

trainer = Trainer(
    model,
    args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['test'],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics

)

Am I missing something? Should I change some of the options?
Thanks!!

Jonesy · December 18, 2021, 5:05pm

(Just posting this in case someone smarter doesn’t post a better idea)

Colab’s performance varies a lot. I ran the same script (dataset in question had 1200 sentences) and sometimes I get out of memory error and sometimes not. My latest project has 270 sentences and ran fine on the first try.

olaffson · December 18, 2021, 5:27pm

thanks. but colab pro+ gives you about 50GB of RAM and a Tesla P100… so I should have enough RAM here…

Jonesy · December 18, 2021, 6:14pm

Make sure you’re not running out of GPU ram though? I think the GPU is capped at 16Gigs. PyTorch takes up 14Gigs.

BramVanroy · December 18, 2021, 8:20pm

What do you mean by this? The amount of GPU memory used generally depends on the model, batch size, and sequence length.

Jonesy · December 18, 2021, 8:40pm

This is where things go over my head. If anyone smarter can interject I’d appreciate it

Jonesy · December 18, 2021, 8:41pm

But if you look at the error message OP posted, it appears that his GPU memory is being hogged by PyTorch?

olaffson · December 19, 2021, 2:00am

@BramVanroy thanks for your input. Is there a rough back-of-the-envelope to know how much memory I need to run a model? It seems the base-large-cased is quite big, but how big?

Thanks

Jonesy · December 20, 2021, 3:09pm

The question belongs in an FAQ.

caitriggs · November 28, 2023, 5:29am

You can use this model memory usage calculator for a general idea: Model Memory Utility - a Hugging Face Space by hf-accelerate

If it’s failing right at the beginning of calling .train() then I don’t think it’s the optimizer RAM as the culprit.

akshat-kumar-akight · April 4, 2024, 5:18pm

If you are getting OOM error inbetween training, most likely you are storing variables tensors on GPU. Push these values to cpu and you should be through with OOM

Topic		Replies	Views
Cuda out of memory while using Trainer API Beginners	1	1761	October 20, 2021
RuntimeError: CUDA out of memory. Tried to allocate 11.53 GiB (GPU 0; 15.90 GiB total capacity; 4.81 GiB already allocated; 8.36 GiB free; 6.67 GiB reserved in total by PyTorch) Beginners	4	3067	April 20, 2021
RuntimeError: CUDA out of memory even with simple inference Beginners	1	5372	January 16, 2022
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 256.00 MiB (GPU 0; 39.56 GiB total capacity; 37.84 GiB already allocated; 242.56 MiB free; 37.96 GiB reserved in total by PyTorch) 🤗Transformers	2	5348	June 7, 2023
RuntimeError: CUDA out of memory. Tried to allocate 384.00 MiB (GPU 0; 11.17 GiB total capacity; 10.62 GiB already allocated; 145.81 MiB free; 10.66 GiB reserved in total by PyTorch) Beginners	8	27445	December 10, 2023

Always getting RuntimeError: CUDA out of memory with Trainer

Related topics