Dataset.transform() hangs indefinitely while finetuning the stable diffusion XL

I am following the finetuning script given at https://github.com/huggingface/diffusers/blob/main/examples/text_to_image/README_sdxl.md

With multi-GPU training, it never prints HERE4 statement.

with accelerator.main_process_first():
    print(accelerator.is_main_process)
    print("===========Here3.1===========")
    if args.max_train_samples is not None:
        dataset["train"] = dataset["train"].shuffle(seed=args.seed).select(range(args.max_train_samples))
    print("===========Here3.2===========")
    # Set the training transforms
    train_dataset = dataset["train"].with_transform(preprocess_train)
print("==========HERE4=============")

corresponding output

Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
10/25/2023 21:18:04 - INFO - main - Distributed environment: MULTI_GPU Backend: nccl
Num processes: 3
Process index: 1
Local process index: 1
Device: cuda:1

Mixed precision type: fp16

10/25/2023 21:18:04 - INFO - main - Distributed environment: MULTI_GPU Backend: nccl
Num processes: 3
Process index: 2
Local process index: 2
Device: cuda:2

Mixed precision type: fp16

10/25/2023 21:18:04 - INFO - main - Distributed environment: MULTI_GPU Backend: nccl
Num processes: 3
Process index: 0
Local process index: 0
Device: cuda:0

Mixed precision type: fp16

You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
{‘variance_type’, ‘clip_sample_range’, ‘thresholding’, ‘dynamic_thresholding_ratio’} was not found in config. Values will be initialized to default values.
{‘attention_type’, ‘reverse_transformer_layers_per_block’, ‘dropout’} was not found in config. Values will be initialized to default values.
==========HERE1=============
==========HERE1=============
==========HERE1=============
==========HERE2=============
==========HERE2=============
==========HERE2=============
==========HERE3=============
True
===========Here3.1===========
===========Here3.2===========
==========HERE3=============
==========HERE3=============

It works just fine with a single GPU setting but, sadly errors out since my NVIDIA A5000 24GB is not enough.

I am using cuda 11.7, Pytorch 1.13.1, diffusers 0.22.0.dev0.

Any inputs, please?