Hello,
I am trying to use Accelerate on Sagemaker training instance (p3.8xlarge) with 4 GPUs, in the hope that Accelerate can automatically leverage the multiple GPU, but it seems it can’t detect them.
so, while setting up HuggingFace estimator for the Sagemaker instance, I can set the distribution parameter as
distribution = {
'pytorchddp': {'enabled': True }
}
,
and use it in HF estimator:
pytorch_estimator = HuggingFace(entry_point='my_script.py',
role=role,
instance_count=1,
instance_type='ml.p3.8xlarge',
py_version='py38',
pytorch_version="1.10.2",
transformers_version="4.17.0",
volume_size=60,
source_dir= '/my_path/',
tags = aws_tags,
max_run=3*24 * 60 * 60,
distribution = distribution,
output_path = "s3://my_s3/output",
checkpoint_s3_uri=checkpoint_s3_bucket,
checkpoint_local_path=checkpoint_local_path
)
inside “my_script.py” I can verify if PyTorch can see all the GPUs:
import torch
available_gpus = [torch.cuda.device(i) for i in range(torch.cuda.device_count())]
print(available_gpus)
and indeed I can see there are 4 GPUs from the output:
[<torch.cuda.device object at 0x7f479f0b8820>, <torch.cuda.device object at 0x7f479f0b8760>, <torch.cuda.device object at 0x7f479f0b87c0>, <torch.cuda.device object at 0x7f479f0b8730>]
however, I then set up Accelerate, and let it print out the available processes it sees,
accelerator = Accelerator(mixed_precision=args.mixed_precision)
...
print("Accelerator has determined the num processes to be: ", accelerator.num_processes)
and it sees only 1:
“Accelerator has determined the num processes to be: 1”
This is further corroborate by the fact that running the same training code on a p3.2xlarge instance takes as much time as on this p3.8xlarge instance, i.e., only one GPU is working (presumably because Accelerate only sees one GPU).
So my question is: is there some special way to make Accelerate aware of all the GPUs in a Sagemaker instance? There were some similar discussion on Github(Cannot run distributed training on sagemaker · Issue #492 · huggingface/accelerate · GitHub) and I thought the issue was resolved…
Thanks!