I am currently experimenting with training on some small LLaMA models (i.e. LLaMA-MoE-3.5B (2/8)). Due to the model size, it cannot fit on a single RTX 3090 Ti GPU with 24GB of VRAM. Note also that due to some requirements, I cannot use adapters to finetune the model. The entire model needs to be trainable.
While I can access multiple GPUs to train the model, the model parameters/components need to be split in a model-parallel or pipelined scheme. However, I have not found any tutorial, repositories, or libraries on HuggingFace that supports the splitting of a single model over multiple GPUs. (There is this page “Model Parallelism”, but it only talks about the higher level concepts on how to split the model, but not the actual implementation.)
Does anyone have any suggestions or on how model pipelining or parallelism, where a model is split between several GPUs, could be conducted?
Unfortunately, due to certain requirements (model size, available GPU, intended use, etc.), we can only conduct multi-GPU training through Model Parallelism or Pipelining.
To further explain the intended scenario, assume that I have a tweaked model with 4 layers:
Now if I have 4 GPUs, is there some way to manually assign each layer to a particular GPU? For example, GPU1 should only load and compute Layer1, and then pass the output to GPU2 for computation by Layer2.