How to minimize memory consume when loading from pretrained models?

I’ve recently working on multiple 7B scale LLMs. When I load LLaMA2-7B from AutoModelForCasualLM, the memory usage is well controlled, around 28GB. However when I load MPT, which has a slightly smaller amount of parameters, it cost 31GB and I can’t load other LLMs like Falcon-7B, since I only have a 32GB system. I’m curious why loading LLaMA2 only take 27GB memory and how can I load Falcon with 32GB memory cap.