I have two AMD GPUs with ROCm. I want to use the SFTTrainer class with the accelerate library to fine-tune an LLM on the two GPUs with distributed data parallelism (DDP). However, I keep running into out of memory (OOM) errors, despite fine tuning running fine on one GPU. Does the accelerate library support ROCm? Is this an issue with something else?
Hi @imamcsiro , could you try with a smaller model and try to see how much is the increase of memory ? It should work since Pytorch supports AMD ROCm GPU