Model Parallelism and Pipelining for Model Training

Hi,

I am currently experimenting with training on some small LLaMA models (i.e. LLaMA-MoE-3.5B (2/8)). Due to the model size, it cannot fit on a single RTX 3090 Ti GPU with 24GB of VRAM. Note also that due to some requirements, I cannot use adapters to finetune the model. The entire model needs to be trainable.

While I can access multiple GPUs to train the model, the model parameters/components need to be split in a model-parallel or pipelined scheme. However, I have not found any tutorial, repositories, or libraries on HuggingFace that supports the splitting of a single model over multiple GPUs. (There is this page “Model Parallelism”, but it only talks about the higher level concepts on how to split the model, but not the actual implementation.)

Does anyone have any suggestions or on how model pipelining or parallelism, where a model is split between several GPUs, could be conducted?

Thanks in advance.

For MultiGPU there are frameworks like accelerate, DeepSpeed, Torchrun that can do this:
Efficient Training on Multiple GPUs (huggingface.co)

Alternatively, you should consider single GPU optimisations:
Methods and tools for efficient training on a single GPU (huggingface.co)

With optimisation I find it is possible to work with models such as Mistral 7B on a 16GB card (colab environment).

Unfortunately, due to certain requirements (model size, available GPU, intended use, etc.), we can only conduct multi-GPU training through Model Parallelism or Pipelining.

To further explain the intended scenario, assume that I have a tweaked model with 4 layers:

Input → Layer1 → Layer2 → Layer3 → Layer4 → Output.

Now if I have 4 GPUs, is there some way to manually assign each layer to a particular GPU? For example, GPU1 should only load and compute Layer1, and then pass the output to GPU2 for computation by Layer2.

Efficient Training on Multiple GPUs (huggingface.co)

Pipeline Parallelism looks like it fits your problem