Hi,
As a beginner, I have a quick question on the behavior of running a script with the command python
v.s. accelerate
. For instance ,take this note book as an example: Google Colab. Say I convert it into a python script called qwen.py, and attempt to run it one a single node with 8 GPUs. I found the following commands:
python qwen.py
accelerate launch --multi-gpu --num-process 8 qwen.py
are different, but related. Specifically, with the python qwen.py
command, I still see all 8 GPUs are used when inspecting nvidia-sim
. Also, I compare it with CUDA_VISIBLE_DEVICES=0 python qwen.py
, the run-time has been reduced by exactly 8x. When running python qwen.py
, I notice trainer.accelerator.num_process=1
and trainer.accelerator.state
has: âDistributed environment: NOâ.
On the other hand, with the accelerate launch
command, I have trainer.accelerator.num_process=8
, trainer.accelerator.state
has: âDistributed environment: MULTI_GPU backend NCCLâ. And all GPUs are occupied when inspecting with nvidia-smi
. However, it seems the occupancy pattern (e.g., memory) is different from calling with python.
I just want to understand better what has been done under the hood, and what is the difference running with python
and accelerate launch
. Moreover, it seems to me python qwen.py
itself should just run on cuda:0, why it runs on all GPUs?