Torchrun uses more vram than running the script with python directly

Hei there

I created a small training script using the HuggingFace Transformers Trainer class. It is used to finetune a Mistral 7b model.
Right now I am performing some tests to find out the VRAM usage across different configurations. There is something that I do not understand and I am not sure if this behaviour is normal, most probably I am doing something wrong.

  • If I run my script the normal python way I have VRAM usage of about 19GB (python test.py)
  • If I run it with torchrun I have around 25GB (torchrun --nproc_per_node 1 test.py)

Both times there is only 1 GPU being used, which I checked with nvidia-smi. If I run torchrun with 2 GPUs (–nproc_per_node 2) both consume around 25GB, which I think is normal. But why is the VRAM usage so much higher with torchrun even when only using one GPU?

The code is in this github repo. Thank you a lot already for everyone that takes times to help me here :slight_smile: