Multi-GPU Issue when trying Diffusers demo

BeefWellington · June 16, 2024, 12:02pm

I tried the demo on Train a diffusion model (huggingface.co), and set the notebook_launcher(train_loop, args, num_processes=2).

I got the following issues:
UserWarning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param’s strides changed since DDP was constructed. This is not an error, but may impair performance.

The GPU is 3090 and 3090ti.

My accelerate env is:

Accelerate version: 0.31.0
Platform: Linux-5.15.153.1-microsoft-standard-WSL2-x86_64-with-glibc2.35
accelerate bash location: /home/user/miniconda3/envs/HuggingFace_Learning/bin/accelerate
Python version: 3.11.9
Numpy version: 1.26.3
PyTorch version (GPU?): 2.3.1+cu121 (True)
PyTorch XPU available: False
PyTorch NPU available: False
PyTorch MLU available: False
System RAM: 62.76 GB
GPU type: NVIDIA GeForce RTX 3090 Ti
Accelerate default config:
- compute_environment: LOCAL_MACHINE
- distributed_type: MULTI_GPU
- mixed_precision: no
- use_cpu: False
- debug: False
- num_processes: 2
- machine_rank: 0
- num_machines: 1
- rdzv_backend: static
- same_network: False
- main_training_function: main
- enable_cpu_affinity: False
- downcast_bf16: False
- tpu_use_cluster: False
- tpu_use_sudo: False

I tried to add Tensor.contiguous() to view and transpose operations, but the same issue does not disappear.

Topic		Replies	Views
Multi-GPU Training sometimes working with 2GPU, but never more than 2 🤗Accelerate	5	3060	August 8, 2024
What does "--multi_gpu" do under the hood? (and how to use it) 🤗Accelerate	7	6755	May 31, 2023
Stable diffusion `train_text_to_image.py` only on one gpu 🧨 Diffusers	5	1195	May 2, 2023
Accelerate on 1 GPU 🤗Accelerate	2	1899	April 8, 2022
No GPUs found in distributed mode 🤗Accelerate	0	951	March 1, 2023

Multi-GPU Issue when trying Diffusers demo

Related topics