Run_mlm.py: generate train split is very slow

I am running run_mlm.py on openwebtext, which is about 10gb. I am using 4 nodes on a Slurm cluster with 4 A100 gpus, so there is a lot of compute power, however to generate the train split it takes about 100 hours for only 10gb. How can this be speed up?

DISTRIBUTED_ARGS="--nproc_per_node $NPROC_PER_NODE \
                  --nnodes $SLURM_JOB_NUM_NODES \
                  --node_rank $SLURM_NODEID \
                  --master_addr $MASTER_ADDR \
                  --master_port $MASTER_PORT"

cmd1="torchrun $DISTRIBUTED_ARGS \
    transformers/examples/pytorch/language-modeling/run_mlm.py \
    --model_name_or_path "microsoft/deberta-v3-base" \
    --dataset_name "openwebtext" \
    --num_train_epochs 3.0 \
    --dataloader_num_workers 2 \
    --fp16 \
    --per_device_train_batch_size 10 \
    --per_device_eval_batch_size 10 \
    --do_train \
    --do_eval \
    --evaluation_strategy "steps" \
    --eval_steps 2000 \
    --report_to "wandb" \
    --overwrite_output_dir"

$cmd1
Generating train split:   1%|▏         | 101641/8013769 [1:08:17<99:51:59, 22.01 examples/s]
Generating train split:   1%|▏         | 101674/8013769 [1:08:19<101:24:22, 21.05 examples/s]
Generating train split:   1%|▏         | 101703/8013769 [1:08:20<104:04:45, 21.12 examples/s]

Latest pytorch and transformers with cuda 11.6. But I guess this is cpu / ram heavy. Is there any other way of generating the train splits? e.g before running the mlm script?

Thanks!

hi @timpal0l ,

There dataloader worker only set ‘2’ is seem to be why pipeline is slow.

I think you can increase ‘dataloader worker number’ to train split speed up.

regards.