Trouble using Torchmetrics with Accelerate in stable diffusion finetuning script


I’m trying to add an FID metric to the log_validation function of the stable diffusion training script under diffusers/examples/text_to_image. I’m following Evaluating Diffusion Models and using the torchmetrics fid. However, even with ~8GB remaining GPU memory, the call to fid.compute() hangs without completing for no obvious reason. All validation code only runs on the main process (accelerator.is_main_process()) and I’ve tried manually moving the inputs and fid object to the appropriate cuda device.

The validation code works fine if I run it standalone, so I suspect it’s something to do with Accelerate, even though I’m manually placing the tensors and fid model. Any ideas?

Update: when I force close the training, I see: “[Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=224, OpType=_ALLGATHER_BASE, NumelIn=2, NumelOut=8, Timeout(ms)=1800000) ran for 1800596 milliseconds before timing out.”

So something is causing that NCCL call to hang…

Hi, did you manage to solve this? I met the same issue. Thanks!

No luck. As a workaround I ended up moving that FID validation code to a standalone process. While it’s less convenient, this decoupling has the benefit of reducing GPU memory & compute load during your accelerate-based training (assuming you have another GPU you can run this FID calculation on).