Setup for Deepspeed Multi GPU Training

According to Trainer — transformers 4.4.2 documentation (see "Deployment in Notebooks) the following code in a Notebook shall work with multiple GPUs:

DeepSpeed requires a distributed environment even when only one process is used.

This emulates a launcher in the notebook

import os
os.environ[‘MASTER_ADDR’] = ‘localhost’
os.environ[‘MASTER_PORT’] = ‘9994’ # modify if RuntimeError: Address already in use
os.environ[‘RANK’] = “0”
os.environ[‘LOCAL_RANK’] = “0”
os.environ[‘WORLD_SIZE’] = “1”

Now proceed as normal, plus pass the deepspeed config file

training_args = TrainingArguments(…, deepspeed=“ds_config.json”)
trainer = Trainer(…)
trainer.train()

However, I am struggling to get this running with 2 GPUs. There seems to be no way to manually tell deepspeed to use 2 GPUs. The documentation says deepseed should detect them automatically but it does not on my system. It only runs on 1 GPU. Depending on the Rank setting it runs either on GPU 0 or 1 but never on both.
(I need to run this on 2 GPUs because I don’t have an RTX3090 with enough memory)

Is there a way to manually tell deepspeed to use 2 GPUs in a Jupyter Notebook like the above example?

I have the same issue. It seems like the only solution is to use the deepspeed CLI launcher, or emulate a distributed environment locally - which seems like more trouble than it’s worth.
Any other guidance here would be appreciated!

If you want to use more than 1 GPU, you must use a multi-process environment for DeepSpeed to work. That is, you have to use the launcher for that purpose and this cannot be accomplished by emulating the distributed environment presented at the beginning of this section.

Link - DeepSpeed Integration

1 Like