Saving unique weights while training on multiple GPU - Trainer

Lucius42 · January 25, 2024, 7:56pm

Hello,

I am training LoRA adaptation of a T5 model in a one-machine multiple GPU setup.
I am using Transformers 4.26.1 and DeepSpeed 0.9.2 and launching my script with deepspeed (thus the parallelization setup is Distributed Data Parallel).

I am using a customized callback in the Trainer to save only the LoRA weights at each epoch. Unfortunately, as I am using multiple GPUs, the script is ran on parallel four times (the numbers of GPU I use). Making it that at the end of each epoch, the weights are saved four times. Is there anything I can do to only save the weights once?

Thanks for your help,

Lucius

Topic		Replies	Views
LoRA training with accelerate / deepspeed DeepSpeed	3	2311	May 28, 2025
More processes than GPUs with DeepSpeed launcher DeepSpeed	0	230	January 25, 2024
Train LoRA adapters on Multiple Datasets in Parallel for llama7B 🤗Transformers	0	959	November 1, 2023
How to run single-node, multi-GPU training with HF Trainer and deepspeed? Beginners	1	1511	April 21, 2024
Multi GPU training - Model parallelism DeepSpeed	1	1882	February 2, 2024

Saving unique weights while training on multiple GPU - Trainer

Related topics