I am running run_mlm.py on openwebtext, which is about 10gb. I am using 4 nodes on a Slurm cluster with 4 A100 gpus, so there is a lot of compute power, however to generate the train split it takes about 100 hours for only 10gb. How can this be speed up?
DISTRIBUTED_ARGS="--nproc_per_node $NPROC_PER_NODE \
--nnodes $SLURM_JOB_NUM_NODES \
--node_rank $SLURM_NODEID \
--master_addr $MASTER_ADDR \
--master_port $MASTER_PORT"
cmd1="torchrun $DISTRIBUTED_ARGS \
transformers/examples/pytorch/language-modeling/run_mlm.py \
--model_name_or_path "microsoft/deberta-v3-base" \
--dataset_name "openwebtext" \
--num_train_epochs 3.0 \
--dataloader_num_workers 2 \
--fp16 \
--per_device_train_batch_size 10 \
--per_device_eval_batch_size 10 \
--do_train \
--do_eval \
--evaluation_strategy "steps" \
--eval_steps 2000 \
--report_to "wandb" \
--overwrite_output_dir"
$cmd1
Generating train split: 1%|▏ | 101641/8013769 [1:08:17<99:51:59, 22.01 examples/s]
Generating train split: 1%|▏ | 101674/8013769 [1:08:19<101:24:22, 21.05 examples/s]
Generating train split: 1%|▏ | 101703/8013769 [1:08:20<104:04:45, 21.12 examples/s]
Latest pytorch and transformers with cuda 11.6. But I guess this is cpu / ram heavy. Is there any other way of generating the train splits? e.g before running the mlm script?
Thanks!