Distributed Training on Sagemaker

philschmid · June 25, 2021, 12:54pm

I can confirm that running the t5/bart_summarization notebook works as it is on the ml.p3dn.24xlarge. I can remember that someone tried to run it on the 16.xlarge and needed to decrease the batch_size too.

The error you attached are still showing the Cuda error

[1,2]<stdout>:RuntimeError: CUDA out of memory. Tried to allocate 376.00 MiB (GPU 2; 15.78 GiB total capacity; 13.72 GiB already allocated; 249.75 MiB free; 14.04 GiB reserved in total by PyTorch)

Can you downsample your dataset a bit and try again? or try some steps on a ml.p3dn.24xlarge

Topic		Replies	Views
Distributed Training run_summarization.py Amazon SageMaker	3	958	July 30, 2021
Sagemaker gpt-j train file error Amazon SageMaker	27	2965	February 22, 2024
OutOfMemoryError: CUDA out of memory while trying to replicate this notebook on sagemaker: https://github.com/huggingface/notebooks/blob/main/sagemaker/24_train_bloom_peft_lora/sagemaker-notebook.ipynb Amazon SageMaker	4	1710	June 16, 2023
ValueError: Source directory does not exist in the repo. Training causal lm in sagemaker Amazon SageMaker	8	1638	July 26, 2021
Simple Fairscale Model Parallelization works locally, but using Sagemaker SMP gives me errors Amazon SageMaker	10	2210	June 27, 2022

Distributed Training on Sagemaker

Related topics