Problem launching train_dreambooth_flux.py (noob here)

Heyho guys,
I’m already sorry for wasting your valuable time but I have a problem following the dreambooth tutorial (DreamBooth)

So I followed every step and used ‘accelerate config default’
but I get following error message while trying to run the part, where you use the train_dreambooth_flux.py with ‘accelerate launch train_dreambooth_flux.py --XXX’
with XXX representing all the recommended arguments as well.

Traceback (most recent call last):
  File "/home/tim/miniconda3/envs/flux/bin/accelerate", line 11, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/tim/miniconda3/envs/flux/lib/python3.11/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
    args.func(args)
  File "/home/tim/miniconda3/envs/flux/lib/python3.11/site-packages/accelerate/commands/launch.py", line 1159, in launch_command
    multi_gpu_launcher(args)
  File "/home/tim/miniconda3/envs/flux/lib/python3.11/site-packages/accelerate/commands/launch.py", line 793, in multi_gpu_launcher
    distrib_run.run(args)
  File "/home/tim/miniconda3/envs/flux/lib/python3.11/site-packages/torch/distributed/run.py", line 910, in run
    elastic_launch(
  File "/home/tim/miniconda3/envs/flux/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tim/miniconda3/envs/flux/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
=======================================================
train_dreambooth_flux.py FAILED
-------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
-------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-12-15_15:02:23
  host      : chariot
  rank      : 1 (local_rank: 1)
  exitcode  : -9 (pid: 713731)
  error_file: <N/A>
  traceback : Signal 9 (SIGKILL) received by PID 713731
=======================================================

Is there any possibility of help?
I’m using a machine with 2 NVIDIAS GeForce RTX4090 (24GB)

I don’t have problems with running Diffusers-Models, so it’s probably not a package-dependent problem.

I don’t want a full solution, just maybe a quick tip, where I can go from here on with this error-message?

For example would it be recommended to activate the logging-function and looking there for more specific feedback? Not sure if I would understand anything…

Thanks in advance and sorry for bad english,
Cheers

2 Likes

Check if the system has resource limits set (e.g., ulimit for processes), which might cause the process to be killed. You can check and increase limits using:

bash

Copy code

ulimit -a
ulimit -v <value_in_kbytes>  # To increase max virtual memory
  • Ensure that the training script and the accelerate configuration are correctly set up for multi-GPU training. Review the accelerate configuration file (~/.cache/huggingface/accelerate/default_config.yaml) to ensure proper GPU allocation.
  • Also, review your train_dreambooth_flux.py script for any potential issues, especially related to resource handling or incorrect configurations.
1 Like
  • Memory (RAM): Ensure that your system has enough memory to handle the task. Check memory usage using tools like htop or free -h on Linux.
  • GPU Memory: If you’re using multiple GPUs, check that the GPUs have sufficient VRAM for your model. You can use nvidia-smi to check GPU memory usage.
  • Swap Space: If your system is running out of RAM, the OS might kill processes to avoid crashes. You could consider increasing swap space, but note that this can slow down training.
1 Like