Hello, when using the example script run_clm.py in examples/pytorch/language-modeling/ with distributed training it seems to repeat the tokenization for each GPU. The messages “Running tokenizer on dataset” and “Grouping texts in chunks of {block_size}” repeat over and over. This does not happen whe…

Clm repeats tokenization when distributed

sgugger July 15, 2022, 2:50pm 6

No, they will enter the context after the main process, and since everything Datasets does is cached, it will use the cache and not redo the tokenization.

Topic		Replies	Views
Cache & parallelize long tokenization step 🤗Transformers	2	997	November 11, 2022
Slow processing with map when using deepspeed or fairscale 🤗Datasets	10	3673	June 25, 2021
Transformers Tokenizer on GPU? 🤗Transformers	3	15286	December 17, 2020
How to force caching of previously tokenized data? (run_clm.py) Beginners	3	694	November 21, 2023
Stucked on tokenization before training when using 3 GPU, but not when using 2 GPU Beginners	0	309	June 25, 2023

Clm repeats tokenization when distributed

Related topics