Clm repeats tokenization when distributed

No, they will enter the context after the main process, and since everything Datasets does is cached, it will use the cache and not redo the tokenization.