Issues running `train_dreambooth.py` in Docker image

:wave: I’m trying to make a docker image that can run diffusers/examples/dreambooth at main · huggingface/diffusers · GitHub

For some reason the docker image keeps hanging on the “Generating Class images” step. Running the train_dreambooth.py script on the host machine runs successfully.

Any idea what could be causing this step to hang (no updates in over 20 minutes)?

Addition Context:
I’m running on an Amazon EC2 g4dn instance (16gb ram + 16gb Nvidia GPU) This is the dockerfile

# https://github.com/huggingface/diffusers/blob/main/docker/diffusers-pytorch-cpu/Dockerfile
FROM amazonlinux

ENV DEBIAN_FRONTEND=noninteractive

RUN yum update
RUN yum install -y python3 \
    git \
    gcc \
    python3-devel
RUN rm -rf /var/lib/apt/lists

# make sure to use venv
RUN python3 -m venv /opt/venv
ENV PATH="/opt/venv/bin:$PATH"
RUN mkdir -p ~/.cache/huggingface
RUN echo -n "hf_abc" > ~/.cache/huggingface/token

# copy repo over
WORKDIR /usr/src/model
COPY . ./

# pre-install the heavy dependencies (these can later be overridden by the deps from setup.py)
RUN python3 -m pip install --no-cache-dir --upgrade pip
RUN python3 -m pip install --no-cache-dir \
        torch \
        torchvision \
        torchaudio \
        --extra-index-url https://download.pytorch.org/whl/cpu
RUN pip install -qq git+https://github.com/huggingface/diffusers
RUN pip install -q -U --pre triton
RUN pip install -U xformers
RUN pip install --no-cache-dir -r requirements.txt
RUN accelerate config default

The only command I’m running on the image is

accelerate launch train_dreambooth.py \
        --pretrained_model_name_or_path="runwayml/stable-diffusion-v1-5" \
        --instance_data_dir="training_data/cluna"  \
        --class_data_dir="training_data/person" \
        --output_dir="stable_diffusion_weights/cluna" \
        --with_prior_preservation --prior_loss_weight=1.0 \
        --instance_prompt="a photo of cluna person" \
        --class_prompt="a photo of person" \
        --resolution=512 \
        --train_batch_size=1 \
        --gradient_accumulation_steps=1  --gradient_checkpointing \
        --use_8bit_adam \
        --enable_xformers_memory_efficient_attention \
        --set_grads_to_none \
        --learning_rate=2e-6 \
        --lr_scheduler="constant" \
        --lr_warmup_steps=0 \
        --max_train_steps=1200 \
        --num_class_images=8 

Solved with the following docker image

FROM pytorch/pytorch:2.0.1-cuda11.7-cudnn8-runtime

# install CLIs
RUN apt-get update && \
    apt-get install -y --no-install-recommends \
    git gcc python3 python3-pip python3-setuptools python3-dev

# setup huggingface
RUN mkdir -p ~/.cache/huggingface
RUN echo -n "hf_abc" > ~/.cache/huggingface/token

# copy repo over
WORKDIR /usr/src/model
COPY . ./

# pre-install the heavy dependencies (these can later be overridden by the deps from setup.py)
RUN python3 -m pip install --no-cache-dir --upgrade pip
RUN python3 -m pip install --no-cache-dir \
        torch \
        torchvision \
        torchaudio \
        --extra-index-url https://download.pytorch.org/whl/cu118
RUN pip install -qq git+https://github.com/huggingface/diffusers
RUN pip install -q -U --pre triton
RUN pip install -U xformers
RUN pip install --no-cache-dir -r requirements.txt
RUN accelerate config default

Make sure to pass the --gpus flag when running docker run with this image