Extremely slow Training split

I am fine-tuning mbert on wikipedia, loaded with Datasets.

if data_args.dataset_lang:
    raw_datasets = load_dataset('wikipedia', language = data_args.dataset_lang, date = '20240201', cache_dir=model_args.cache_dir)
    if "validation" not in raw_datasets.keys():
        raw_datasets["validation"] = load_dataset('wikipedia', language = data_args.dataset_lang,
            date = '20240201',
            split=f"train[:{data_args.validation_split_percentage}%]",
            cache_dir=model_args.cache_dir,
            trust_remote_code=True, 
        )

        raw_datasets["train"] = load_dataset('wikipedia', language = data_args.dataset_lang,
            date = '20240201',
            split=f"train[{data_args.validation_split_percentage}%:]",
            cache_dir=model_args.cache_dir,
            trust_remote_code=True, 
        )

This is the bash script submitted to Slurm (irrelevant lines removed):

#!/bin/bash
#SBATCH -o .../examples/language-modeling/slogs/sl_ka1_%A.out
#SBATCH -e .../examples/language-modeling//slogs/sl_ka1_%A.out
#SBATCH -N 1      # nodes requested
#SBATCH -n 1      # tasks requested
#SBATCH --gres=gpu:8  # use 1 GPU
#SBATCH --mem=60000  # memory in Mb
#SBATCH --partition=PGR-Standard
#SBATCH -t 24:00:00  # time requested in hour:minute:seconds
#SBATCH --cpus-per-task=16  # number of cpus to use - there are 32 on each node

torchrun --nproc_per_node 8 run_mlm.py \
--model_name_or_path bert-base-multilingual-cased \
--cache_dir **${CACHE_HOME}**2\
--dataset_lang ${LANG} \
--dataset_name Wikipedia \
--output_dir ${OUTPUT_DIR} \
--do_train \
--do_eval \
--per_device_train_batch_size 4 \
--per_device_eval_batch_size 4 \
--gradient_accumulation_steps 2 \
--max_seq_length 256 \
--overwrite_output_dir \
--ft_params_num 7667712 \
--evaluation_strategy steps \
--eval_steps 1000 \
--dataloader_num_workers 16 \
--preprocessing_num_workers 16 \
--validation_split_percentage 5 \
--load_best_model_at_end \
--save_total_limit 2

Here is the report of the speed, very slow.

    Downloading data: 100%|██████████| 14.1k/14.1k [00:00<00:00, 14.0MB/s]
Downloading data: 100%|██████████| 205M/205M [00:44<00:00, 4.59MB/s] 
Generating train split: 0 examples [00:00, ? examples/s]Extracting content from /.../language-modeling/cache_directory22/downloads/f797c17d35d578a4c1a3f251847095789ec04ae453f10623aeb8366ff4797a07
Generating train split: 170787 examples [17:57, 158.45 examples/s]

Thank you all in advance!

Hi! Only the 20220301 date is preprocessed, so loading other dates will take more time.

Still, you can speed up the generation by specifying num_proc= in load_dataset to process the files in parallel.

PS: wikimedia/wikipedia hosts the newer Wikipedia dumps, so also check that repo before preprocessing them yourself.