I have a train.py script that works (very slowly) on a g5 instance with 1 GPU and 24 GB memory. When I deploy the same script to a multi-GPU instance and batch size 1, I am getting CUDA OOM error. train.py model = AutoModelForCasualLM.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2", torch_dty…

CUDA OOM error when using data-distributed mode on AWS p4d.24xlarge instance

swtb April 10, 2024, 8:23am 2

Your batch and optimiser backlog also need to fit into VRAM.

Try using FP16 and/or Gradient Checkpointing, Gradient Accumulation

Topic		Replies	Views
Regarding CUDA OOM! Amazon SageMaker	0	499	February 14, 2023
Distributed Training on Sagemaker Amazon SageMaker	13	2739	August 5, 2021
OutOfMemoryError: CUDA out of memory while trying to replicate this notebook on sagemaker: https://github.com/huggingface/notebooks/blob/main/sagemaker/24_train_bloom_peft_lora/sagemaker-notebook.ipynb Amazon SageMaker	4	1687	June 16, 2023
Distributed Training run_summarization.py Amazon SageMaker	3	935	July 30, 2021
CUDA out of memory when running on multiple GPUs Beginners	0	583	June 22, 2022