Accelerate sees only one GPU on multi-GPU Sagemaker instance

Hello,
I am trying to use Accelerate on Sagemaker training instance (p3.8xlarge) with 4 GPUs, in the hope that Accelerate can automatically leverage the multiple GPU, but it seems it can’t detect them.
so, while setting up HuggingFace estimator for the Sagemaker instance, I can set the distribution parameter as

distribution = {
    
        'pytorchddp': {'enabled': True }
}

,
and use it in HF estimator:

pytorch_estimator = HuggingFace(entry_point='my_script.py',
                            role=role,
                            instance_count=1,
                            instance_type='ml.p3.8xlarge',
                            py_version='py38',
                            pytorch_version="1.10.2",
                            transformers_version="4.17.0",
                            volume_size=60,
                            source_dir= '/my_path/',
                            tags = aws_tags,
                            max_run=3*24 * 60 * 60,
                           distribution = distribution,
                            output_path = "s3://my_s3/output",
                            checkpoint_s3_uri=checkpoint_s3_bucket,
                            checkpoint_local_path=checkpoint_local_path
                           )

inside “my_script.py” I can verify if PyTorch can see all the GPUs:

import torch
available_gpus = [torch.cuda.device(i) for i in range(torch.cuda.device_count())]
print(available_gpus)

and indeed I can see there are 4 GPUs from the output:
[<torch.cuda.device object at 0x7f479f0b8820>, <torch.cuda.device object at 0x7f479f0b8760>, <torch.cuda.device object at 0x7f479f0b87c0>, <torch.cuda.device object at 0x7f479f0b8730>]

however, I then set up Accelerate, and let it print out the available processes it sees,

accelerator = Accelerator(mixed_precision=args.mixed_precision)
...
print("Accelerator has determined the num processes to be: ", accelerator.num_processes)

and it sees only 1:
“Accelerator has determined the num processes to be: 1”

This is further corroborate by the fact that running the same training code on a p3.2xlarge instance takes as much time as on this p3.8xlarge instance, i.e., only one GPU is working (presumably because Accelerate only sees one GPU).
So my question is: is there some special way to make Accelerate aware of all the GPUs in a Sagemaker instance? There were some similar discussion on Github(Cannot run distributed training on sagemaker · Issue #492 · huggingface/accelerate · GitHub) and I thought the issue was resolved…
Thanks!

Hello @alexcarterkarsus,

Are you following this notebooks/sagemaker/22_accelerate_sagemaker_examples at main · huggingface/notebooks · GitHub

You might be launching or using the integration in wrong way. Please refer the above repo for concrete examples and try running them and then adapting it for your usecases.