Setup for Deepspeed Multi GPU Training

maxmaier · July 10, 2022, 8:34pm

According to Trainer — transformers 4.4.2 documentation (see "Deployment in Notebooks) the following code in a Notebook shall work with multiple GPUs:

DeepSpeed requires a distributed environment even when only one process is used.

This emulates a launcher in the notebook

import os
os.environ[‘MASTER_ADDR’] = ‘localhost’
os.environ[‘MASTER_PORT’] = ‘9994’ # modify if RuntimeError: Address already in use
os.environ[‘RANK’] = “0”
os.environ[‘LOCAL_RANK’] = “0”
os.environ[‘WORLD_SIZE’] = “1”

Now proceed as normal, plus pass the deepspeed config file

training_args = TrainingArguments(…, deepspeed=“ds_config.json”)
trainer = Trainer(…)
trainer.train()

However, I am struggling to get this running with 2 GPUs. There seems to be no way to manually tell deepspeed to use 2 GPUs. The documentation says deepseed should detect them automatically but it does not on my system. It only runs on 1 GPU. Depending on the Rank setting it runs either on GPU 0 or 1 but never on both.
(I need to run this on 2 GPUs because I don’t have an RTX3090 with enough memory)

Is there a way to manually tell deepspeed to use 2 GPUs in a Jupyter Notebook like the above example?

Charm3link · October 26, 2022, 8:18pm

I have the same issue. It seems like the only solution is to use the deepspeed CLI launcher, or emulate a distributed environment locally - which seems like more trouble than it’s worth.
Any other guidance here would be appreciated!

Indramal · December 7, 2022, 8:49am

If you want to use more than 1 GPU, you must use a multi-process environment for DeepSpeed to work. That is, you have to use the launcher for that purpose and this cannot be accomplished by emulating the distributed environment presented at the beginning of this section.

Link - DeepSpeed Integration

Topic		Replies	Views
Issues with using DeepSpeed on multiple GPUs DeepSpeed	2	2551	September 9, 2022
Accelerate Multi-GPU on several Nodes How to 🤗Accelerate	3	6371	October 13, 2021
I have a question about multi-GPU inference DeepSpeed	0	1521	March 9, 2023
Multi-node training 🤗Accelerate	2	3029	January 16, 2023
How to run single-node, multi-GPU training with HF Trainer and deepspeed? Beginners	1	1575	April 21, 2024

Setup for Deepspeed Multi GPU Training

DeepSpeed requires a distributed environment even when only one process is used.

This emulates a launcher in the notebook

Now proceed as normal, plus pass the deepspeed config file

Related topics