GPT-2 finetuning with Openwebtext: Socket timeout

Hi,

I am currently trying to run GPT-2 model for finetuning on a single compute node with 8 A100 GPUs with Openwebtext dataset. I am using the run_clm.py script to submit the job. However, during the tokenization phase, I am encountering socket timeout error. I would greatly appreciate if anyone can provide any insight regarding this error. Details of the problem is described below. Thanks in advance!

In particular, I am encountering the following error:

^MRunning tokenizer on dataset #4:  40%|████      | 382/952 [30:07<1:20:26,  8.47s/ba]^[[A^[[A^[[A^[[A
^MRunning tokenizer on dataset #1:  63%|██████▎   | 596/952 [30:07<15:14,  2.57s/ba]^[[A

^MRunning tokenizer on dataset #6:  41%|████      | 389/952 [30:07<1:21:22,  8.67s/ba]^[[A^[[A^[[A^[[A^[[A^[[A    work = default_pg.barrier(opts=opts)
    work = default_pg.barrier(opts=opts)
work = default_pg.barrier(opts=opts)
RuntimeError: Socket Timeout
RuntimeError: Socket Timeout
RuntimeError: Socket Timeout
    work = default_pg.barrier(opts=opts)
    work = default_pg.barrier(opts=opts)
RuntimeError: Socket Timeout
RuntimeError    : work = default_pg.barrier(opts=opts)Socket Timeout

RuntimeError: Socket Timeout
    work = default_pg.barrier(opts=opts)
RuntimeError: Socket Timeout

The command I am using to run the script, after I get an allocation via Slurm:

python -m torch.distributed.launch --nnodes=1 --node_rank=0 --nproc_per_node=8 run_clm.py \
    --model_name_or_path gpt2 \
    --dataset_name openwebtext \
    --tokenizer_name gpt2 \
    --block_size 1024 \
    --preprocessing_num_workers 8 \
    --do_train \
    --do_eval \
    --output_dir ./test-clm-openwebtext-run-8-3

Even if I don’t specify block_size and preprocessing_num_workers, I am encountering the same error.

Here is the environment information:

Collecting environment information...
PyTorch version: 1.8.0
Is debug build: False
CUDA used to build PyTorch: 11.1
ROCM used to build PyTorch: N/A

OS: NVIDIA DGX Server (x86_64)
GCC version: (GCC) 10.2.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.17

Python version: 3.7.11 (default, Jul 27 2021, 14:32:16)  [GCC 7.5.0] (64-bit runtime)
Python platform: Linux-3.10.0-1127.18.2.el7.x86_64-x86_64-with-centos-7.8.2003-Core
Is CUDA available: True
CUDA runtime version: 11.0.194
GPU models and configuration:
GPU 0: A100-PCIE-40GB
GPU 1: A100-PCIE-40GB
GPU 2: A100-PCIE-40GB
GPU 3: A100-PCIE-40GB
GPU 4: A100-PCIE-40GB
GPU 5: A100-PCIE-40GB
GPU 6: A100-PCIE-40GB
GPU 7: A100-PCIE-40GB

Nvidia driver version: 450.142.00
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.21.2
[pip3] torch==1.8.0
[pip3] torchaudio==0.8.0a0+a751e1d
[pip3] torchvision==0.9.0
[conda] blas                      1.0                         mkl
[conda] cudatoolkit               11.1.1               h6406543_8    conda-forge
[conda] ffmpeg                    4.3                  hf484d3e_0    pytorch
[conda] mkl                       2021.4.0           h06a4308_640
[conda] mkl-service               2.4.0            py37h402132d_0    conda-forge
[conda] mkl_fft                   1.3.1            py37h3e078e5_1    conda-forge
[conda] mkl_random                1.2.2            py37h219a48f_0    conda-forge
[conda] numpy                     1.21.2           py37h20f2e39_0
[conda] numpy-base                1.21.2           py37h79a1101_0
[conda] pytorch                   1.8.0           py3.7_cuda11.1_cudnn8.0.5_0    pytorch
[conda] torchaudio                0.8.0                      py37    pytorch
[conda] torchvision               0.9.0                py37_cu111    pytorch

I am using transformer version “4.15.0.dev0”.

Please let me know if I can provide any additional information. Thanks!