Accelerate + Multi-GPU+ Automatic1111 + Dreambooth Extension

I’m currently trying to use accelerate to run Dreambooth via Automatic1111’s webui using 4xRTX 3090.

Here’s my setup, what I’ve done so far, including the issues I’ve encountered so far and how I solved them:

OS: Ubuntu Mate 22.04
Environment Setup:
Using miniconda, created environment name: sd-dreambooth

cloned Auto1111’s repo, navigated to extensions, cloned dreambooth extension

running it with accelerate without modifications to ./webui.sh causes multiple instances of webui to be run. I needed to add:
--num_processes 1 to the accelerate launch args towards the end of the script.

For some reason, cudatoolkit didn’t install by running the script so I was getting an error related to:
"str2optimizer32bit"
fixed by running
conda install cudatoolkit

I noticed during launch that I was getting an error saying that triton wasn’t installed.
fixed with
pip install triton

I usually use fp16, but after installing triton I started getting an error related to:
slow_conv2d_cpu" not implemented for 'Half'
which after some research led me to believe that I just had to use no mixed precision, so I added
--mixed_precision no
to accelerate launch args as well. That solved that problem.

So currently, my accelerate launch is:
accelerate launch --multi_gpu --gpu_ids 0,1,2,3 --mixed_precision no --num_machines 1 --num_processes 1 --num_cpu_threads_per_process=1

So, at this point I get no errors by using the following advanced settings in Dreambooth:

8 Bit Adam = Yes
Mixed Precision = No
Memory Attention = Default
Don’t Cache Latents = False
Train Text Enconder = True
Train EMA = True
Shuffle After Epoch = False
Pad Tokens = True
Gradient Checkpointing = False

Web UI launches cleanly without errors (line 129 is just because I’m using conda env):

Install script for stable-diffusion + Web UI
Tested on Debian 11 (Bullseye)
################################################################

################################################################
Running on lukium user
################################################################

################################################################
Repo already cloned, using it as install directory
################################################################

################################################################
Create and activate python venv
################################################################
./webui.sh: line 129: source: -/: invalid option
source: usage: source filename [arguments]

################################################################
Accelerating launch.py...
################################################################
Python 3.10.6 (main, Oct 24 2022, 16:07:47) [GCC 11.2.0]
Commit hash: 828438b4a190759807f9054932cae3a8b880ddf1
Installing requirements for Web UI
Installing requirements for Dreambooth
Checking Dreambooth requirements.
Dreambooth revision is c589a3596ade64228de8a7851f50c2470c7a76aa
Args: ['extensions/sd_dreambooth_extension/install.py']
[*] Diffusers version is 0.7.2.
[*] Torch version is 1.12.1+cu116.
[*] Torch vision version is 0.13.1+cu116.
[*] Transformers version is 4.21.0.
[*] Xformers


Launching Web UI with arguments: --ckpt-dir ./checkpoints --disable-safe-unpickle --xformers
Patching transformers to fix kwargs errors.
Dreambooth API layer loaded
LatentDiffusion: Running in eps-prediction mode
DiffusionWrapper has 859.52 M params.
making attention of type 'vanilla' with 512 in_channels
Working with z of shape (1, 4, 32, 32) = 4096 dimensions.
making attention of type 'vanilla' with 512 in_channels
Loading weights [81761151] from /home/lukium/stable-diffusion/instances/sd-dreambooth/models/Stable-diffusion/sd-15/sd-v1-5.ckpt
Global Step: 840000
Using VAE found similar to selected model: /home/lukium/stable-diffusion/instances/sd-dreambooth/models/Stable-diffusion/sd-15/sd-v1-5.vae.pt
Loading VAE weights from: /home/lukium/stable-diffusion/instances/sd-dreambooth/models/Stable-diffusion/sd-15/sd-v1-5.vae.pt
Applying xformers cross attention optimization.
Model loaded.
Loaded a total of 0 textual inversion embeddings.
Embeddings: 
Running on local URL:  http://127.0.0.1:7860

Everything seems good, training works, but only 1 GPU (0) gets used still.

nvidia-smi shows that everything is good to go:

Sat Nov 26 12:00:27 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 520.56.06    Driver Version: 520.56.06    CUDA Version: 11.8     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:02:00.0 Off |                  N/A |
| 57%   41C    P8    38W / 370W |      5MiB / 24576MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce ...  Off  | 00000000:06:00.0 Off |                  N/A |
|  0%   56C    P8    30W / 350W |      5MiB / 24576MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA GeForce ...  Off  | 00000000:09:00.0 Off |                  N/A |
|  0%   52C    P8    22W / 350W |      5MiB / 24576MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA GeForce ...  Off  | 00000000:0A:00.0 Off |                  N/A |
|  0%   52C    P8    24W / 420W |      5MiB / 24576MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1706      G   /usr/lib/xorg/Xorg                  4MiB |
|    1   N/A  N/A      1706      G   /usr/lib/xorg/Xorg                  4MiB |
|    2   N/A  N/A      1706      G   /usr/lib/xorg/Xorg                  4MiB |
|    3   N/A  N/A      1706      G   /usr/lib/xorg/Xorg                  4MiB |
+-----------------------------------------------------------------------------+

Any suggestions to get all 4 GPUs to work?

did you try with with --num_processes 4( ie. one for each machine)

I did try that. But the result was the process running 4 times, all of them using GPU 0

For this part you can create a basic accelerator and run it under if accelerator.is_main_processes most likely. Because currently your accelerate launch is just running “multi gpu” on a single gpu (so not multi-gpu). The command ran is mixing up input/compute types (that we should probably guard against!)

To learn more about what I mean, check out this doc tutorial: Deferring Executions

This issue similar to my issue - Notebook_launcher set num_processes=2 but it say Launching training on one GPU. in Kaggle

Only one GPU use for training.

The accellerate flag only assigns 6 threads per Python process, and nothing more.
As far as I understand from the Python manuals, this has no effect on any GPU related tasks (unless the webui has additional code for this, but to my knowledge, the A111 web ui is written to use only one GPU)