Memory Error While Fine-tuning AYA on 8 H100 GPUs


I am currently trying to fine-tune an AYA model on 8 H100 GPUs, but I’m encountering a memory error. My system has 640 GB of GPU RAM, which I assumed would be sufficient for this task. I’m not using PEFT or LoRA, and my batch size is set to 1.
I’m wondering if anyone has encountered a similar issue and could provide some guidance. How many GPUs are typically recommended for this task? Any help would be greatly appreciated.

Thanks in advance!

1 Like