DDP Program hang/stuck in trainer.predict() and trainer.evaluate()

Hi,

Brief description of the issue:
I am trying to finetune a “RWKV/rwkv-4-169m-pile” model using PEFT on my own dataset and test it. I used the Trainer API to train and test the model. However, using DDP, after the training finished, when I called trainer.predict(test_dataset) in rank 0, the code would be stuck at that line and after a while, a GPU timeout message will show. This issue happens at computing the first test data batch.

After some digging, I found that the hang happens at the line
labels = self.accelerator.pad_across_processes(labels, dim=1, pad_index=-100)
in trainer.predict(). Digging deeper, the hang is in the _gpu_gather() function in the operations.py in the accelerate library.

It should not be the problem of the dataset because the training works fine and I also try using the training dataset as the test set in trainer.predict() but the problem remains.

I would appreciate it if anyone can help with this.

Relevant code snippet:

dist.init_process_group("nccl")
rank = dist.get_rank()
.......
trainer.train()
if rank == 0:
    trainer.predict(test_dataset)

Error:
[rank2]:[E ProcessGroupNCCL.cpp:523] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=81130, OpType=ALLREDUCE, NumelIn=6145, NumelOut=6145, Timeout(ms)=600000) ran for 600093 milliseconds before timing out.
…

Package versions:
python 3.10
torch 2.2.0
transformers 4.37.1
accelerate 0.26.1

NVIDIA-SMI 535.154.05 Driver Version: 535.154.05 CUDA Version: 12.2

What kind of GPU are you using?

I am using 8 A100 80GB.

NVIDIA-SMI 535.154.05             Driver Version: 535.154.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-SXM4-80GB          On  | 00000000:07:00.0 Off |                    0 |
| N/A   31C    P0              95W / 400W |  79409MiB / 81920MiB |     35%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM4-80GB          On  | 00000000:0A:00.0 Off |                    0 |
| N/A   25C    P0              87W / 400W |  79553MiB / 81920MiB |     48%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-SXM4-80GB          On  | 00000000:44:00.0 Off |                    0 |
| N/A   25C    P0              90W / 400W |  79553MiB / 81920MiB |     46%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA A100-SXM4-80GB          On  | 00000000:4A:00.0 Off |                    0 |
| N/A   32C    P0             279W / 400W |  79553MiB / 81920MiB |     36%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   4  NVIDIA A100-SXM4-80GB          On  | 00000000:84:00.0 Off |                    0 |
| N/A   33C    P0              93W / 400W |  79553MiB / 81920MiB |     48%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   5  NVIDIA A100-SXM4-80GB          On  | 00000000:8A:00.0 Off |                    0 |
| N/A   28C    P0              89W / 400W |  79553MiB / 81920MiB |     47%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   6  NVIDIA A100-SXM4-80GB          On  | 00000000:C0:00.0 Off |                    0 |
| N/A   26C    P0             266W / 400W |  79553MiB / 81920MiB |     39%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   7  NVIDIA A100-SXM4-80GB          On  | 00000000:C3:00.0 Off |                    0 |
| N/A   32C    P0             111W / 400W |  79409MiB / 81920MiB |     47%      Default |
|                                         |                      |             Disabled