Understand running a script with `python` and `accelerate`

Hi,

As a beginner, I have a quick question on the behavior of running a script with the command python v.s. accelerate. For instance ,take this note book as an example: Google Colab. Say I convert it into a python script called qwen.py, and attempt to run it one a single node with 8 GPUs. I found the following commands:

python qwen.py
accelerate launch --multi-gpu --num-process 8 qwen.py

are different, but related. Specifically, with the python qwen.py command, I still see all 8 GPUs are used when inspecting nvidia-sim. Also, I compare it with CUDA_VISIBLE_DEVICES=0 python qwen.py, the run-time has been reduced by exactly 8x. When running python qwen.py, I notice trainer.accelerator.num_process=1 and trainer.accelerator.state has: ‘Distributed environment: NO’.

On the other hand, with the accelerate launch command, I have trainer.accelerator.num_process=8, trainer.accelerator.state has: ‘Distributed environment: MULTI_GPU backend NCCL’. And all GPUs are occupied when inspecting with nvidia-smi. However, it seems the occupancy pattern (e.g., memory) is different from calling with python.

I just want to understand better what has been done under the hood, and what is the difference running with python and accelerate launch. Moreover, it seems to me python qwen.py itself should just run on cuda:0, why it runs on all GPUs?

1 Like