Memory Error While Fine-tuning AYA on 8 H100 GPUs

ArmanAsq · April 30, 2024, 3:14pm

Hello,

I am currently trying to fine-tune an AYA model on 8 H100 GPUs, but I’m encountering a memory error. My system has 640 GB of GPU RAM, which I assumed would be sufficient for this task. I’m not using PEFT or LoRA, and my batch size is set to 1.
I’m wondering if anyone has encountered a similar issue and could provide some guidance. How many GPUs are typically recommended for this task? Any help would be greatly appreciated.

Thanks in advance!

Topic		Replies	Views
Lora finetuning 35 B model error 🤗Accelerate	0	145	June 11, 2024
Cuda out of memory issue training whisper model on single GPU Intermediate	0	907	December 15, 2023
Regarding CUDA OOM! Amazon SageMaker	0	497	February 14, 2023
CUDA out of memory when running on multiple GPUs Beginners	0	580	June 22, 2022
Issue with LoRA Adapter Loading on Multiple GPUs during Fine-Tuning with Accelerate and SFTTrainer 🤗Accelerate	3	992	September 18, 2024

Memory Error While Fine-tuning AYA on 8 H100 GPUs

Related topics