LM example run_clm.py isn't distributing data across multiple GPUs as expected

does this solve your question: Using Transformers with DistributedDataParallel — any examples?