Distributed Training on Sagemaker

Phillip the super hero. I was able to make it work. Thank you. I increased the volume size and decreased the ‘per_device_train_batch_size’: 2, ‘per_device_eval_batch_size’: 2.

Thanks. I am having issues now trying to train a Causal lm / text generation here so I open the ValueError: Source directory does not exist in the repo. Training causal lm in sagemaker issue. I do not think you guys have done a demo / notebook with this task, I reviewed the run_clm.py and looks fine.

Thank you Phillip.

3 Likes