It’s actually very simple and straightforward to do this. The Trainer will automatically pick up the number of devices you want to use. Any and all of the examples in transformers/examples/pytorch
are capable to be ran on multi-gpu automatically. HuggingFace fully supports all DDP
In my example I’ll use the text classification one.
To check if it’s using two GPU’s the whole time, I’ll start with watch -n0.1 nvidia-smi
in a separate terminal.
Now I’m assuming we’re running this through a clone of the repo, so the args will be setup similar to how the tests are done:
torchrun --n-proc-per-node 2 examples/pytorch/text-classification/run_glue.py --model_name_or_path distilbert-base-uncased --output_dir outputs --train_file ./tests/fixtures/tests_samples/MRPC/train.csv --validation_file ./tests/fixtures/tests_samples/MRPC/dev.csv --do_train --do_eval --per_device_train_batch_size=2 --per_device_eval_batch_size=1
And that’s all it takes, just launch it like normal via torchrun
! You’ll see that both GPUs get utilized immediatly
If you wanted to avoid this, mix it in with our Accelerate library, run accelerate config
, and then all you have to do is launch the script via:
accelerate launch examples/pytorch/text-classification/run_glue.py --model_name_or_path distilbert-base-uncased --output_dir outputs --train_file ./tests/fixtures/tests_samples/MRPC/train.csv --validation_file ./tests/fixtures/tests_samples/MRPC/dev.csv --do_train --do_eval --per_device_train_batch_size=2 --per_device_eval_batch_size=1
The config you set will wrap around all the complicated torchrun
bits, so you don’t need to do all of that yourself. Even if you don’t use Accelerate for any actual training
Not entirely sure what Cherry is, but Accelerate is a lower-level library capable of doing DDP across single and multi node GPU and TPU, so any and all forms of DDP. But again, Trainer handles all the magic for you in that regard, you just need to launch it the right way