Cuda memory error on unchanged workshop 1 notebooks

kjackson · November 27, 2021, 2:44am

I am running notebooks 1 and 3 unchanged from https://github.com/philschmid/huggingface-sagemaker-workshop-series/tree/main/workshop_1_getting_started_with_amazon_sagemaker

And I get the following error:

RuntimeError: CUDA out of memory. Tried to allocate 192.00 MiB (GPU 0; 15.78 GiB total capacity; 14.80 GiB already allocated; 44.75 MiB free; 14.83 GiB reserved in total by PyTorch)

I am trying with different batch sizes and learning rates, but can someone help me understand why not everyone got the same error if we’re all using the same AWS resources?

philschmid · December 1, 2021, 9:24am

Hello @kjackson,

I run the notebook now twice as it on “main” and never got any error.

2021-12-01 08:46:23 Uploading - Uploading generated training model
2021-12-01 08:48:23 Completed - Training job completed
ProfilerReport-1638347816: NoIssuesFound
Training seconds: 490
Billable seconds: 490

Topic		Replies	Views
OutOfMemoryError: CUDA out of memory while trying to replicate this notebook on sagemaker: https://github.com/huggingface/notebooks/blob/main/sagemaker/24_train_bloom_peft_lora/sagemaker-notebook.ipynb Amazon SageMaker	4	1687	June 16, 2023
Regarding CUDA OOM! Amazon SageMaker	0	498	February 14, 2023
CUDA out of memory when running on multiple GPUs Beginners	0	582	June 22, 2022
CUDA OOM error when using data-distributed mode on AWS p4d.24xlarge instance Beginners	7	343	December 4, 2024
Cuda out of memory error Intermediate	11	42330	January 27, 2025

Cuda memory error on unchanged workshop 1 notebooks

Related topics