Trainer API for Model Parallelism using AutoModelForQuestionAnswering

Hello,
I am having problems using the Trainer API with multiple GPUs for model parallelism using an AutoModelForQuestionAnswering. Specifically, I want to train a model (llama3-8b) that is too large to fit on a single GPU using multiple GPUs.

My understanding is that the Trainer API should automatically detect multiple GPUs and distribute the model accordingly. I believe this is working because I can see the memory of all 4 GPUs on my system being used (using nvidia-smi -l -1). Furthermore, I receive the error:

“RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:3 and cuda:0! (when checking argument for argument target in method wrapper_CUDA_nll_loss_forward)”

If I am understanding this correctly, then the model is distributed among devices 0-3 (or at least devices 0 and 3), but the problem is the labels are not on device 3. Is there a way to ensure the labels are sent to the correct device? Since I am using the Trainer class, I’m not using to(device), but I was hoping for something similar.

Other things to note:
The code works for smaller models on a single GPU
The code works for prediction
when checking trainer.args.parallel_mode, is is ParallelMode.NOT_PARALLEL which seems incorrect
trainer_args={per_device_train_batch_size=4, num_train_epochs=25, output_dir='output', logging_steps=1}.

I halfway figured this out. Running the script with torchrun rather than python3 handles device mapping for you:
e.g.:
torchrun script.py

Since torchrun handles device mapping, no devicemap should be specified in the code when creating the model (e.g. don’t specify device_map=‘auto’)

This works for LoRA training, but I get OOM errors without LoRA. I think I should be able to fit the model distributed across all 4 GPUs. I think the solution to this is to not use the Trainer class, and just specify the training routine and devices manually.