I encounter the below error when I finetune my dataset on mbart
RuntimeError: CUDA out of memory. Tried to allocate 16.00 MiB (GPU 0; 10.76 GiB total capacity; 9.57 GiB already allocated; 16.25 MiB free; 9.70 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CON
my train data contains only 5000 sentences. Could anyone of you help me in sorting this out?
As @kkumari06 says, reduce batch size. I recommend restarting the kernel any time you get this error, to make sure you have a clean GPU memory; then cut the batch size in half. Repeat until it fits in GPU memory or until you hit batch size of 1… in which case, you’ll need to switch to a smaller pretrained model. (If training a model from scratch, you can instead reduce the size of your model, for example by reducing maximum input size or reducing number of layers.) Finally, you may want to bump up the gradient accumulation if your batch size is very small. For example, if you have a batch size of 4, gradient accumulation of 8 would give you an “effective” batch size of 32, which some research suggests is ideal… however, YMMV.
I’m hyper tuning bert-multilingual-uncased model for NER use case. I’m using AWS EC2 instance g4dn.metal which has 8 GPU. My training sample contact 110k samples. I tried the model training instance having 4 GPU I got CUDA out of memory error.
Tried reducing batch size, clear cache, setting up max_split_size (pytorch memory management) didn’t fix the error. So I started the model in bigger instance with 8 GPU still facing the same error.
@Raisa06 Number of GPUs doesn’t matter unless you’re being very particular to ensure the model is being split among those GPUs. Instead of 1/8th the model being on each GPU, you likely have 8 full copies of the model. Aim for GPUs with more memory per card. Also, if you have already decreased batch size and gradient accumulation steps as much as possible, try using distilbert-base-multilingual-cased, which is a smaller version of the model you’re training. Also, with HF’s accelerate library you can enable deepspeed which is a free memory decrease, and you can also use it to train with mixed precision to further decrease memory usage.
Hi @Raisa06
Getting the same issue were you able to resolve it or not and how you clear the cache for the instance as I am using sagemaker for fine tuning.
Remember to delete unneeded tensors and variables. You should also move the tensors and variables that you want to store to cpu and then call torch.cuda.empty_cache() to free up GPU space, after you’re done with each batch