Hi,
I am currently trying to run GPT-2 model for finetuning on a single compute node with 8 A100 GPUs with Openwebtext dataset. I am using the run_clm.py script to submit the job. However, during the tokenization phase, I am encountering socket timeout error. I would greatly appreciate if anyone can provide any insight regarding this error. Details of the problem is described below. Thanks in advance!
In particular, I am encountering the following error:
^MRunning tokenizer on dataset #4: 40%|ââââ | 382/952 [30:07<1:20:26, 8.47s/ba]^[[A^[[A^[[A^[[A
^MRunning tokenizer on dataset #1: 63%|âââââââ | 596/952 [30:07<15:14, 2.57s/ba]^[[A
^MRunning tokenizer on dataset #6: 41%|ââââ | 389/952 [30:07<1:21:22, 8.67s/ba]^[[A^[[A^[[A^[[A^[[A^[[A work = default_pg.barrier(opts=opts)
work = default_pg.barrier(opts=opts)
work = default_pg.barrier(opts=opts)
RuntimeError: Socket Timeout
RuntimeError: Socket Timeout
RuntimeError: Socket Timeout
work = default_pg.barrier(opts=opts)
work = default_pg.barrier(opts=opts)
RuntimeError: Socket Timeout
RuntimeError : work = default_pg.barrier(opts=opts)Socket Timeout
RuntimeError: Socket Timeout
work = default_pg.barrier(opts=opts)
RuntimeError: Socket Timeout
The command I am using to run the script, after I get an allocation via Slurm:
python -m torch.distributed.launch --nnodes=1 --node_rank=0 --nproc_per_node=8 run_clm.py \
--model_name_or_path gpt2 \
--dataset_name openwebtext \
--tokenizer_name gpt2 \
--block_size 1024 \
--preprocessing_num_workers 8 \
--do_train \
--do_eval \
--output_dir ./test-clm-openwebtext-run-8-3
Even if I donât specify block_size
and preprocessing_num_workers
, I am encountering the same error.
Here is the environment information:
Collecting environment information...
PyTorch version: 1.8.0
Is debug build: False
CUDA used to build PyTorch: 11.1
ROCM used to build PyTorch: N/A
OS: NVIDIA DGX Server (x86_64)
GCC version: (GCC) 10.2.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.17
Python version: 3.7.11 (default, Jul 27 2021, 14:32:16) [GCC 7.5.0] (64-bit runtime)
Python platform: Linux-3.10.0-1127.18.2.el7.x86_64-x86_64-with-centos-7.8.2003-Core
Is CUDA available: True
CUDA runtime version: 11.0.194
GPU models and configuration:
GPU 0: A100-PCIE-40GB
GPU 1: A100-PCIE-40GB
GPU 2: A100-PCIE-40GB
GPU 3: A100-PCIE-40GB
GPU 4: A100-PCIE-40GB
GPU 5: A100-PCIE-40GB
GPU 6: A100-PCIE-40GB
GPU 7: A100-PCIE-40GB
Nvidia driver version: 450.142.00
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Versions of relevant libraries:
[pip3] numpy==1.21.2
[pip3] torch==1.8.0
[pip3] torchaudio==0.8.0a0+a751e1d
[pip3] torchvision==0.9.0
[conda] blas 1.0 mkl
[conda] cudatoolkit 11.1.1 h6406543_8 conda-forge
[conda] ffmpeg 4.3 hf484d3e_0 pytorch
[conda] mkl 2021.4.0 h06a4308_640
[conda] mkl-service 2.4.0 py37h402132d_0 conda-forge
[conda] mkl_fft 1.3.1 py37h3e078e5_1 conda-forge
[conda] mkl_random 1.2.2 py37h219a48f_0 conda-forge
[conda] numpy 1.21.2 py37h20f2e39_0
[conda] numpy-base 1.21.2 py37h79a1101_0
[conda] pytorch 1.8.0 py3.7_cuda11.1_cudnn8.0.5_0 pytorch
[conda] torchaudio 0.8.0 py37 pytorch
[conda] torchvision 0.9.0 py37_cu111 pytorch
I am using transformer version â4.15.0.dev0â.
Please let me know if I can provide any additional information. Thanks!