Extremely slow Training split

I am fine-tuning mbert on wikipedia, loaded with Datasets.

if data_args.dataset_lang:
    raw_datasets = load_dataset('wikipedia', language = data_args.dataset_lang, date = '20240201', cache_dir=model_args.cache_dir)
    if "validation" not in raw_datasets.keys():
        raw_datasets["validation"] = load_dataset('wikipedia', language = data_args.dataset_lang,
            date = '20240201',
            split=f"train[:{data_args.validation_split_percentage}%]",
            cache_dir=model_args.cache_dir,
            trust_remote_code=True, 
        )

        raw_datasets["train"] = load_dataset('wikipedia', language = data_args.dataset_lang,
            date = '20240201',
            split=f"train[{data_args.validation_split_percentage}%:]",
            cache_dir=model_args.cache_dir,
            trust_remote_code=True, 
        )

This is the bash script submitted to Slurm (irrelevant lines removed):

#!/bin/bash
#SBATCH -o .../examples/language-modeling/slogs/sl_ka1_%A.out
#SBATCH -e .../examples/language-modeling//slogs/sl_ka1_%A.out
#SBATCH -N 1      # nodes requested
#SBATCH -n 1      # tasks requested
#SBATCH --gres=gpu:8  # use 1 GPU
#SBATCH --mem=60000  # memory in Mb
#SBATCH --partition=PGR-Standard
#SBATCH -t 24:00:00  # time requested in hour:minute:seconds
#SBATCH --cpus-per-task=16  # number of cpus to use - there are 32 on each node

torchrun --nproc_per_node 8 run_mlm.py \
--model_name_or_path bert-base-multilingual-cased \
--cache_dir **${CACHE_HOME}**2\
--dataset_lang ${LANG} \
--dataset_name Wikipedia \
--output_dir ${OUTPUT_DIR} \
--do_train \
--do_eval \
--per_device_train_batch_size 4 \
--per_device_eval_batch_size 4 \
--gradient_accumulation_steps 2 \
--max_seq_length 256 \
--overwrite_output_dir \
--ft_params_num 7667712 \
--evaluation_strategy steps \
--eval_steps 1000 \
--dataloader_num_workers 16 \
--preprocessing_num_workers 16 \
--validation_split_percentage 5 \
--load_best_model_at_end \
--save_total_limit 2

Here is the report of the speed, very slow.

    Downloading data: 100%|██████████| 14.1k/14.1k [00:00<00:00, 14.0MB/s]
Downloading data: 100%|██████████| 205M/205M [00:44<00:00, 4.59MB/s] 
Generating train split: 0 examples [00:00, ? examples/s]Extracting content from /.../language-modeling/cache_directory22/downloads/f797c17d35d578a4c1a3f251847095789ec04ae453f10623aeb8366ff4797a07
Generating train split: 170787 examples [17:57, 158.45 examples/s]

Thank you all in advance!

Hi! Only the 20220301 date is preprocessed, so loading other dates will take more time.

Still, you can speed up the generation by specifying num_proc= in load_dataset to process the files in parallel.

PS: wikimedia/wikipedia hosts the newer Wikipedia dumps, so also check that repo before preprocessing them yourself.

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.