Trainer API for Model Parallelism using AutoModelForQuestionAnswering

henryst57 · May 14, 2024, 7:52pm

Hello,
I am having problems using the Trainer API with multiple GPUs for model parallelism using an AutoModelForQuestionAnswering. Specifically, I want to train a model (llama3-8b) that is too large to fit on a single GPU using multiple GPUs.

My understanding is that the Trainer API should automatically detect multiple GPUs and distribute the model accordingly. I believe this is working because I can see the memory of all 4 GPUs on my system being used (using nvidia-smi -l -1). Furthermore, I receive the error:

“RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:3 and cuda:0! (when checking argument for argument target in method wrapper_CUDA_nll_loss_forward)”

If I am understanding this correctly, then the model is distributed among devices 0-3 (or at least devices 0 and 3), but the problem is the labels are not on device 3. Is there a way to ensure the labels are sent to the correct device? Since I am using the Trainer class, I’m not using to(device), but I was hoping for something similar.

Other things to note:
The code works for smaller models on a single GPU
The code works for prediction
when checking trainer.args.parallel_mode, is is ParallelMode.NOT_PARALLEL which seems incorrect
trainer_args={per_device_train_batch_size=4, num_train_epochs=25, output_dir='output', logging_steps=1}.

henryst57 · June 5, 2024, 3:49pm

I halfway figured this out. Running the script with torchrun rather than python3 handles device mapping for you:
e.g.:
torchrun script.py

Since torchrun handles device mapping, no devicemap should be specified in the code when creating the model (e.g. don’t specify device_map=‘auto’)

This works for LoRA training, but I get OOM errors without LoRA. I think I should be able to fit the model distributed across all 4 GPUs. I think the solution to this is to not use the Trainer class, and just specify the training routine and devices manually.

Topic		Replies	Views
Trainer.evalute() with multi GPUs results Expected all tensors to be on the same device, but found at least two devices, cuda:3 and cuda:0! Beginners	2	79	February 11, 2025
How to generate with a single gpu when a model is loaded onto multiple gpus? Beginners	0	882	February 9, 2024
Training using multiple GPUs Beginners	20	20048	February 25, 2024
Trainer API for Model Parallelism on Multiple GPUs 🤗Transformers	5	4146	September 10, 2024
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:2 and cuda:0! (when checking argument for argument index in method wrapper_CUDA__index_select) DeepSpeed	5	3454	August 26, 2024

Trainer API for Model Parallelism using AutoModelForQuestionAnswering

Related topics