Issues with Dataset Loading and Checkpoint Saving using FSDP with HuggingFace Trainer on SLURM Multi-Node Setup

Hello,

I’m training a model using a SLURM multi-node setup with accelerate, HuggingFace Trainer, FlashAttention2, and FSDP. I’m working on a shared filesystem, which is mounted on all nodes so that every process can access the same data and output directories.

I launch my training script without explicitly importing the accelerate Python package in code. Instead, I run it like this:

accelerate launch --config_file my_config.yaml train_my_model.py

I have two questions:


1. Dataset Loading
In my training script, I load the dataset like this:

ds = load_dataset(ds_name, split='train')

Do I need to make sure this is executed only by the main process (e.g., using main_process_first() or a file lock) to prevent conflicts between processes in a distributed setting?

Also, given this setup, is it safe to freely use .map() on the dataset inside the training script, or do I need to handle synchronization or process-specific behavior there as well?


2. FSDP Checkpoint Saving
I’m using a shared filesystem, and during the checkpoint saving step, I get the following error. I suspect it’s related to concurrent writing in a multi-node setup with FSDP.

1: [rank9]:   File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 2241, in train
1: [rank9]:     return inner_training_loop(
1: [rank9]:   File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 2612, in _inner_training_loop
1: [rank9]:     self._maybe_log_save_evaluate(
1: [rank9]:   File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 3092, in _maybe_log_save_evaluate
1: [rank9]:     self._save_checkpoint(model, trial)
1: [rank9]:   File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 3194, in _save_checkpoint
1: [rank9]:     self._save_optimizer_and_scheduler(output_dir)
1: [rank9]:   File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 3311, in _save_optimizer_and_scheduler
1: [rank9]:     save_fsdp_optimizer(
1: [rank9]:   File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/fsdp_utils.py", line 200, in save_fsdp_optimizer
1: [rank9]:     dist_cp.save_state_dict(
1: [rank9]:   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/checkpoint/state_dict_saver.py", line 41, in save_state_dict
1: [rank9]:     return _save_state_dict(
1: [rank9]:   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/checkpoint/state_dict_saver.py", line 271, in _save_state_dict
1: [rank9]:     central_plan = distW.reduce_scatter("plan", local_step, global_step)
1: [rank9]:   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/checkpoint/utils.py", line 167, in reduce_scatter
1: [rank9]:     all_data = self.gather_object(local_data)
1: [rank9]:   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/checkpoint/utils.py", line 106, in gather_object
1: [rank9]:     dist.gather_object(
1: [rank9]:   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
1: [rank9]:     return func(*args, **kwargs)
1: [rank9]:   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 2529, in gather_object
1: [rank9]:     gather(
1: [rank9]:   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
1: [rank9]:     return func(*args, **kwargs)
1: [rank9]:   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 3105, in gather
1: [rank9]:     work = default_pg.gather(output_tensors, input_tensors, opts)
1: [rank9]: RuntimeError: NCCL Error 2: unhandled system error (run with NCCL_DEBUG=INFO for details)

How should I configure fsdp_config properly in this case to avoid such errors during saving?

Thank you!

1 Like

It seems to be an unsolved issue, but there may be a way to avoid it (such as downgrading the accelerate library).

1 Like