Clm repeats tokenization when distributed

Hello, when using the example script in examples/pytorch/language-modeling/ with distributed training it seems to repeat the tokenization for each GPU. The messages “Running tokenizer on dataset” and “Grouping texts in chunks of {block_size}” repeat over and over. This does not happen when running on a single GPU. Also the tokenization takes significantly longer. My script call is:
–model_name_or_path EleutherAI/gpt-neo-125M
–train_file $TRAIN_FILE
–validation_file $VAL_FILE
–block_size $BLOCK_SIZE
–per_device_train_batch_size 3
–per_device_eval_batch_size 3
–gradient_accumulation_steps 4
–deepspeed “deepspeed_zero2_config.json”
–output_dir $OUTPUT_DIR
This is a huge issue when using a large body of text. Any help would be appreciated.

1 Like

Hello, sorry to bump but I was wondering if anyone had any information about this? On a large dataset it makes tokenization go from a few hours on one GPU to several days on multiple. Or alternatively could I tokenize on one GPU and then load it from cache manually? Arrow does not seem to recognize that it is the same dataset when looking at the cache. Thank you for any help.

The tokenization is only made on the main process then cached for the others, thanks to the context manager. This is only if you run a multinode training that every node will do the tokenization, in which case you should preprocess your dataset once and for all.

Thank you for your response. I believe this is just a result of how my grid is set up then. I’m wondering why each tokenization takes so much longer in distributed though. When I do it non distributed the tokenization takes about 2 hours for the dataset however in distributed training each instance of tokenization takes 12-15 hours. It then does this once for each node. I tried letting one round of tokenization finish on the distributed training and then restarting the program to see if it would use the cached dataset. Doing distributed training, the first process loaded the cached processed dataset but then the other nodes started doing their own tokenization again. How can I get the other processes to recognize that it should use that same cached data? For context my grid uses SLURM to allocate resources so I will usually end up getting different nodes every time I train. Thank you so much for your help.

Hi @sgugger I’m a beginner but I was wondering if this line transformers/examples/pytorch/language-modeling/ at main · huggingface/transformers · GitHub
should be something like if is_local_main_process():. Because I guess main_process_first means other processes would still enter this code block and redo the tokenization
Please forgive me if the question is too dumb

No, they will enter the context after the main process, and since everything Datasets does is cached, it will use the cache and not redo the tokenization.