Regarding CUDA OOM!

Rajd99 · February 14, 2023, 1:47am

@shamanez just wanted to know, I am using 4 GPUS with 24gb of memory each on AWS SageMaker. Still for batch size - 2 (Keeping batch size very low) after number of iterations it goes CUDA - OUT OF MEMORY! Can you suggest me how can I resolve this?
Also, after every training epoch it checks the entire Validation set so the number of iterations gets multiplied ! Any hints to solve this as well?

Topic		Replies	Views
CUDA out of memory when running on multiple GPUs Beginners	0	580	June 22, 2022
Cuda memory error on unchanged workshop 1 notebooks Amazon SageMaker	1	790	December 1, 2021
Getting CUDA memory error at endpoint - what are my options? Amazon SageMaker	5	3282	May 20, 2022
CUDA OOM error when using data-distributed mode on AWS p4d.24xlarge instance Beginners	7	339	December 4, 2024
OutOfMemoryError: CUDA out of memory while trying to replicate this notebook on sagemaker: https://github.com/huggingface/notebooks/blob/main/sagemaker/24_train_bloom_peft_lora/sagemaker-notebook.ipynb Amazon SageMaker	4	1683	June 16, 2023

Regarding CUDA OOM!

Related topics