LLaMA 7B GPU Memory Requirement

Hi @Forbu14,

in full precision (float32), every parameter of the model is stored in 32 bits or 4 bytes. Hence 4 bytes / parameter * 7 billion parameters = 28 billion bytes = 28 GB of GPU memory required, for inference only. In half precision, each parameter would be stored in 16 bits, or 2 bytes. Hence you would need 14 GB for inference. There are now also 8 bit and 4 bit algorithms, so with 4 bits (or half a byte) per parameter you would need 3.5 GB of memory for inference. However usually there’s also some additional overhead as you generate tokens, see this nice blog post: Calculating GPU memory for serving LLMs | Substratus.AI.

For training, it depends on the optimizer you use and whether you use full fine-tuning vs. PEFT (e.g. QLoRa).

In case you use regular AdamW, then you need 8 bytes per parameter (as it not only stores the parameters, but also their gradients and second order gradients). Hence, for a 7B model you would need 8 bytes per parameter * 7 billion parameters = 56 GB of GPU memory. If you use AdaFactor, then you need 4 bytes per parameter, or 28 GB of GPU memory. With the optimizers of bitsandbytes (like 8 bit AdamW), you would need 2 bytes per parameter, or 14 GB of GPU memory.

In case you use parameter-efficient methods like QLoRa, memory requirements are greatly reduced: Making LLMs even more accessible with bitsandbytes, 4-bit quantization and QLoRA. Basically one quantizes the base model in 8 or 4 bits and then train adapters on top in float16.

I highly recommend this guide: Methods and tools for efficient training on a single GPU which goes over all of this in much more detail.

30 Likes