Distributed Training run_summarization.py

For me it worked with ml.p3.16xlarge and batch_size of 2