According to Trainer — transformers 4.4.2 documentation (see "Deployment in Notebooks) the following code in a Notebook shall work with multiple GPUs:
DeepSpeed requires a distributed environment even when only one process is used.
This emulates a launcher in the notebook
import os
os.environ[‘MASTER_ADDR’] = ‘localhost’
os.environ[‘MASTER_PORT’] = ‘9994’ # modify if RuntimeError: Address already in use
os.environ[‘RANK’] = “0”
os.environ[‘LOCAL_RANK’] = “0”
os.environ[‘WORLD_SIZE’] = “1”
Now proceed as normal, plus pass the deepspeed config file
training_args = TrainingArguments(…, deepspeed=“ds_config.json”)
trainer = Trainer(…)
trainer.train()
However, I am struggling to get this running with 2 GPUs. There seems to be no way to manually tell deepspeed to use 2 GPUs. The documentation says deepseed should detect them automatically but it does not on my system. It only runs on 1 GPU. Depending on the Rank setting it runs either on GPU 0 or 1 but never on both.
(I need to run this on 2 GPUs because I don’t have an RTX3090 with enough memory)
Is there a way to manually tell deepspeed to use 2 GPUs in a Jupyter Notebook like the above example?