Problem launching train_dreambooth_flux.py (noob here)

timmerscher · December 15, 2024, 2:19pm

Heyho guys,
I’m already sorry for wasting your valuable time but I have a problem following the dreambooth tutorial (DreamBooth)

So I followed every step and used ‘accelerate config default’
but I get following error message while trying to run the part, where you use the train_dreambooth_flux.py with ‘accelerate launch train_dreambooth_flux.py --XXX’
with XXX representing all the recommended arguments as well.

Traceback (most recent call last):
  File "/home/tim/miniconda3/envs/flux/bin/accelerate", line 11, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/tim/miniconda3/envs/flux/lib/python3.11/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
    args.func(args)
  File "/home/tim/miniconda3/envs/flux/lib/python3.11/site-packages/accelerate/commands/launch.py", line 1159, in launch_command
    multi_gpu_launcher(args)
  File "/home/tim/miniconda3/envs/flux/lib/python3.11/site-packages/accelerate/commands/launch.py", line 793, in multi_gpu_launcher
    distrib_run.run(args)
  File "/home/tim/miniconda3/envs/flux/lib/python3.11/site-packages/torch/distributed/run.py", line 910, in run
    elastic_launch(
  File "/home/tim/miniconda3/envs/flux/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tim/miniconda3/envs/flux/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
=======================================================
train_dreambooth_flux.py FAILED
-------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
-------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-12-15_15:02:23
  host      : chariot
  rank      : 1 (local_rank: 1)
  exitcode  : -9 (pid: 713731)
  error_file: <N/A>
  traceback : Signal 9 (SIGKILL) received by PID 713731
=======================================================

Is there any possibility of help?
I’m using a machine with 2 NVIDIAS GeForce RTX4090 (24GB)

I don’t have problems with running Diffusers-Models, so it’s probably not a package-dependent problem.

I don’t want a full solution, just maybe a quick tip, where I can go from here on with this error-message?

For example would it be recommended to activate the logging-function and looking there for more specific feedback? Not sure if I would understand anything…

Thanks in advance and sorry for bad english,
Cheers

Alanturner2 · December 15, 2024, 2:25pm

Check if the system has resource limits set (e.g., ulimit for processes), which might cause the process to be killed. You can check and increase limits using:

bash

Copy code

ulimit -a
ulimit -v <value_in_kbytes>  # To increase max virtual memory

Ensure that the training script and the accelerate configuration are correctly set up for multi-GPU training. Review the accelerate configuration file (~/.cache/huggingface/accelerate/default_config.yaml) to ensure proper GPU allocation.
Also, review your train_dreambooth_flux.py script for any potential issues, especially related to resource handling or incorrect configurations.

Alanturner2 · December 16, 2024, 12:19am

Memory (RAM): Ensure that your system has enough memory to handle the task. Check memory usage using tools like htop or free -h on Linux.
GPU Memory: If you’re using multiple GPUs, check that the GPUs have sufficient VRAM for your model. You can use nvidia-smi to check GPU memory usage.
Swap Space: If your system is running out of RAM, the OS might kill processes to avoid crashes. You could consider increasing swap space, but note that this can slow down training.

Topic		Replies	Views
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 3 (pid: 10561) of binary 🤗Accelerate	4	4852	January 24, 2024
Multi-GPU Training sometimes working with 2GPU, but never more than 2 🤗Accelerate	5	3013	August 8, 2024
Accelerate + Multi-GPU+ Automatic1111 + Dreambooth Extension 🤗Accelerate	5	16328	May 2, 2023
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 🤗Accelerate	1	618	August 15, 2024
Error when fine-tuning on multi-gpu 🤗Transformers	1	615	February 17, 2025

Problem launching train_dreambooth_flux.py (noob here)

Related topics