Manual splitting of model across multi-GPU setup

I have been doing some testing with training Lora’s and have a question that I don’t see an answer for.
Here is my hardware setup:
Intel 3435X
128GB DDR5 in 8 channel
2x3090 FE cards with NVlink
Dual boot Ubuntu/Windows

I use Ubuntu as my Dev and training setup. I am using Oobabooga Text gen webui as a GUI and the training pro extension.
I am running test with training Xwin 70B via transformers by using the following flags

--load-in-4bit
--use_double_quant
--auto-devices

I can train the model at rank 64, alpha 128, max context length 45, batch size 1, gradient accumulation 5. I use target projections q-k-v, and NEFnoise scale 2.

I can get decent results from these settings, but I would like more wiggle room to experiment. This is where my question comes in.

Edit to fix VRAM amounts.

During initial loading, both GPU’s are loaded up fine. Before training they’re sitting about 17.3GB on GPU 0, 20.3GB on GPU 1. Once I start training, the values go up to 21.25GB on GPU 0, 24.178GB on GPU 1. If I try to adjust much of anything I encounter the Cuda OOM errors. I am pretty sure that this is happening because of the memory load imbalance that happens once training starts, as it tries to overfill GPU 1.

Is there a way that I can manually control the layer split between GPUs? Right now it appears as though it is using (# of layers)/(# of GPUs) to try and split the model evenly between the GPUs, without accounting for the overhead of the various code libraries that also have to be loaded to GPU. I’d like to be able to offload more layers to GPU 0 in order to take advantage of my unused VRAM.

If needed, I’m willing to try altering some of the python code used by the transformers package installed by Oobabooga.

Hi,

Yes you can manually edit the device_map, used to place the model on the available devices.

See this page for more info: Handling big models for inference