Trouble using Torchmetrics with Accelerate in stable diffusion finetuning script

CactusFacts · November 21, 2023, 11:11pm

Hello,

I’m trying to add an FID metric to the log_validation function of the stable diffusion training script under diffusers/examples/text_to_image. I’m following Evaluating Diffusion Models and using the torchmetrics fid. However, even with ~8GB remaining GPU memory, the call to fid.compute() hangs without completing for no obvious reason. All validation code only runs on the main process (accelerator.is_main_process()) and I’ve tried manually moving the inputs and fid object to the appropriate cuda device.

The validation code works fine if I run it standalone, so I suspect it’s something to do with Accelerate, even though I’m manually placing the tensors and fid model. Any ideas?

Update: when I force close the training, I see: “[Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=224, OpType=_ALLGATHER_BASE, NumelIn=2, NumelOut=8, Timeout(ms)=1800000) ran for 1800596 milliseconds before timing out.”

So something is causing that NCCL call to hang…

wren93 · January 18, 2024, 5:54pm

Hi, did you manage to solve this? I met the same issue. Thanks!

CactusFacts · January 18, 2024, 6:14pm

No luck. As a workaround I ended up moving that FID validation code to a standalone process. While it’s less convenient, this decoupling has the benefit of reducing GPU memory & compute load during your accelerate-based training (assuming you have another GPU you can run this FID calculation on).

emmanuel-oladokun · May 7, 2025, 12:41pm

TL;DR pass “sync_on_compute=False” in when instantiating the metric class e.g.
FrechetInceptionDistance(feature=self.cfg.inception_feature, normalize=True, sync_on_compute=False).to(device)

I had the same issue and managed to solve it by passing “sync_on_compute=False” to the class when I instantiated it. FID is meant to work in distributed mode, so even if you are only calling compute on your main rank (or Rank 0), it still knows it’s in a multi-gpu setting and tries to sync the results from all GPUs. This seems to be true for all metrics that inherit from the Metric() class in torchmetrics.metric, not just FID

Topic		Replies	Views
Bug on multi-gpu trainer with accelerate 🤗Accelerate	6	529	February 18, 2025
Accelerate Distributed Randomly Hangs 🤗Accelerate	0	87	September 11, 2024
What is the correct way to compute metrics while training using Accelerate? 🤗Accelerate	0	22	October 29, 2024
Replicating the same code in gpus 🤗Accelerate	1	353	March 6, 2023
Accelerate device error when running evaluation 🤗Accelerate	0	56	August 12, 2024

Trouble using Torchmetrics with Accelerate in stable diffusion finetuning script

Related topics