Clm repeats tokenization when distributed

calderma · February 22, 2022, 7:07pm

Hello, when using the example script run_clm.py in examples/pytorch/language-modeling/ with distributed training it seems to repeat the tokenization for each GPU. The messages “Running tokenizer on dataset” and “Grouping texts in chunks of {block_size}” repeat over and over. This does not happen when running on a single GPU. Also the tokenization takes significantly longer. My script call is:
deepspeed run_clm.py
–model_name_or_path EleutherAI/gpt-neo-125M
–train_file $TRAIN_FILE
–validation_file $VAL_FILE
–block_size $BLOCK_SIZE
–overwrite_output_dir
–fp16
–per_device_train_batch_size 3
–per_device_eval_batch_size 3
–do_train
–do_eval
–group_by_length
–gradient_accumulation_steps 4
–deepspeed “deepspeed_zero2_config.json”
–output_dir $OUTPUT_DIR
This is a huge issue when using a large body of text. Any help would be appreciated.

calderma · March 9, 2022, 8:45pm

Hello, sorry to bump but I was wondering if anyone had any information about this? On a large dataset it makes tokenization go from a few hours on one GPU to several days on multiple. Or alternatively could I tokenize on one GPU and then load it from cache manually? Arrow does not seem to recognize that it is the same dataset when looking at the cache. Thank you for any help.

sgugger · March 9, 2022, 9:01pm

The tokenization is only made on the main process then cached for the others, thanks to the context manager. This is only if you run a multinode training that every node will do the tokenization, in which case you should preprocess your dataset once and for all.

calderma · March 10, 2022, 9:37pm

Thank you for your response. I believe this is just a result of how my grid is set up then. I’m wondering why each tokenization takes so much longer in distributed though. When I do it non distributed the tokenization takes about 2 hours for the dataset however in distributed training each instance of tokenization takes 12-15 hours. It then does this once for each node. I tried letting one round of tokenization finish on the distributed training and then restarting the program to see if it would use the cached dataset. Doing distributed training, the first process loaded the cached processed dataset but then the other nodes started doing their own tokenization again. How can I get the other processes to recognize that it should use that same cached data? For context my grid uses SLURM to allocate resources so I will usually end up getting different nodes every time I train. Thank you so much for your help.

songyf1994 · July 15, 2022, 12:18am

Hi @sgugger I’m a beginner but I was wondering if this line transformers/examples/pytorch/language-modeling/run_clm_no_trainer.py at main · huggingface/transformers · GitHub
should be something like if is_local_main_process():. Because I guess main_process_first means other processes would still enter this code block and redo the tokenization
Please forgive me if the question is too dumb

sgugger · July 15, 2022, 2:50pm

No, they will enter the context after the main process, and since everything Datasets does is cached, it will use the cache and not redo the tokenization.

Topic		Replies	Views
How to force caching of previously tokenized data? (run_clm.py) Beginners	3	675	November 21, 2023
Cache & parallelize long tokenization step 🤗Transformers	2	985	November 11, 2022
Processing Large Dataset for Training GPT2 model 🤗Datasets	4	1148	April 12, 2023
Run_clm.py is very slow on gpu (used to take seconds) Beginners	0	892	May 20, 2021
Stucked on tokenization before training when using 3 GPU, but not when using 2 GPU Beginners	0	308	June 25, 2023

Clm repeats tokenization when distributed

Related topics