Hi, I wanted to play with the LLaMA 7B model recently released. With the command below I got OOM error on a T4 16GB GPU.
How much GPU do I need to run the 7B model? In the Meta FAIR version of the model, we can adjust the max batch size to make it work on a single T4. What should be done here to make it work on a single T4 GPU? Thanks!

tokenizer = transformers.LlamaTokenizer.from_pretrained("/path/to/tokenizer/")
model = transformers.LlamaForSequenceClassification.from_pretrained("/path/to/llama-7b/")

To run the 7B model in full precision, you need 7 * 4 = 28GB of GPU RAM. You should add torch_dtype=torch.float16 to use half the memory and fit the model on a T4.

13*4 = 52 - this is the memory requirement for the inference. For the training, usually, you need more memory (depending on tensor Parallelism/ Pipeline parallelism/ Optimizer/ ZeRo offloading parameters/ framework and others). Contact me: https://www.linkedin.com/in/denistimonin/

Basicly the idea is that you store the row weights (weigths are store in 16bit parameters format) and you also need to store the gradient of the weights. As 1 bytes = 8 bits, you need 2B for every weights and another 2B for the gradient. And thatâ€™s only the case if you use SGD optimization because if you use ADAM as your optimizer, you need more memory per weights.
So you ends up with a raw memory requirement of 4*nb_parameters if you use SGD.

You can read the LoRa paper : https://arxiv.org/pdf/2106.09685.pdf, at the beginning they said that using lora for finetuning by 3 because you donâ€™t have to store the gradient and gradient momentum of the optimizer.

in full precision (float32), every parameter of the model is stored in 32 bits or 4 bytes. Hence 4 bytes / parameter * 7 billion parameters = 28 billion bytes = 28 GB of GPU memory required, for inference only. In half precision, each parameter would be stored in 16 bits, or 2 bytes. Hence you would need 14 GB for inference. There are now also 8 bit and 4 bit algorithms, so with 4 bits (or half a byte) per parameter you would need 3.5 GB of memory for inference.

For training, it depends on the optimizer you use.

In case you use regular AdamW, then you need 8 bytes per parameter (as it not only stores the parameters, but also their gradients and second order gradients). Hence, for a 7B model you would need 8 bytes per parameter * 7 billion parameters = 56 GB of GPU memory. If you use AdaFactor, then you need 4 bytes per parameter, or 28 GB of GPU memory. With the optimizers of bitsandbytes (like 8 bit AdamW), you would need 2 bytes per parameter, or 14 GB of GPU memory.

Thanks much. This is very useful! Iâ€™m curious to learn more about bitsandbytes - e.g. AdamW 8bit to get it working w 14GB. Does anyone have the model on HF by using the last optimizer you mention?
â€“Aaron

Is your answer assuming a batch size of 1? In other words, how does the memory requirement change with the batch size? I think the number of parameters will remain the same, so we will not need additional memory to store them â€“ the extra memory will be needed to store a bigger batch.