Training using multiple GPUs

@sgugger The model is the routing transformer language model (RoutingTransformerLM). The source code is here: