Using AMD'S RocM with accelerate library

imamcsiro · January 17, 2024, 11:20pm

I have two AMD GPUs with ROCm. I want to use the SFTTrainer class with the accelerate library to fine-tune an LLM on the two GPUs with distributed data parallelism (DDP). However, I keep running into out of memory (OOM) errors, despite fine tuning running fine on one GPU. Does the accelerate library support ROCm? Is this an issue with something else?

marcsun13 · January 24, 2024, 6:32pm

Hi @imamcsiro , could you try with a smaller model and try to see how much is the increase of memory ? It should work since Pytorch supports AMD ROCm GPU

Topic		Replies	Views
AMD ROCm multiple gpu's garbled output 🤗Accelerate	12	2006	July 30, 2024
Hugging face accelerate and torch DDP crash with out-of-memory errors for a model runs fine on a single GPU 🤗Accelerate	3	4447	January 1, 2024
Segmentation fault with gradient_checkpointing on multiGPU Beginners	1	918	September 5, 2024
How to load large model with multiple GPU cards? Beginners	8	43595	October 25, 2023
GPU OOM when training Beginners	2	3225	October 20, 2021

Using AMD'S RocM with accelerate library

Related topics