Multi-GPU Issue when trying Diffusers demo

I tried the demo on Train a diffusion model (huggingface.co), and set the notebook_launcher(train_loop, args, num_processes=2).

I got the following issues:
UserWarning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param’s strides changed since DDP was constructed. This is not an error, but may impair performance.

The GPU is 3090 and 3090ti.

My accelerate env is:

  • Accelerate version: 0.31.0
  • Platform: Linux-5.15.153.1-microsoft-standard-WSL2-x86_64-with-glibc2.35
  • accelerate bash location: /home/user/miniconda3/envs/HuggingFace_Learning/bin/accelerate
  • Python version: 3.11.9
  • Numpy version: 1.26.3
  • PyTorch version (GPU?): 2.3.1+cu121 (True)
  • PyTorch XPU available: False
  • PyTorch NPU available: False
  • PyTorch MLU available: False
  • System RAM: 62.76 GB
  • GPU type: NVIDIA GeForce RTX 3090 Ti
  • Accelerate default config:
    - compute_environment: LOCAL_MACHINE
    - distributed_type: MULTI_GPU
    - mixed_precision: no
    - use_cpu: False
    - debug: False
    - num_processes: 2
    - machine_rank: 0
    - num_machines: 1
    - rdzv_backend: static
    - same_network: False
    - main_training_function: main
    - enable_cpu_affinity: False
    - downcast_bf16: False
    - tpu_use_cluster: False
    - tpu_use_sudo: False

I tried to add Tensor.contiguous() to view and transpose operations, but the same issue does not disappear.

1 Like