Model Parallism

Hi, I want pre-train LLaMA-7B on my multi-gpu system. However, because of the huge model, I face OOM. I already read several method in this forum and HuggingFace documents, but they use from_pretrained which cannot be used in my experiment since I have to training from scratch.

What I want is to split model (not DDP or DP) and allocate each layers in different GPU, i.e., Model Parallism. Is there any way or solution to resolve this problem? Thank you.