Hi, I wanted to play with the LLaMA 7B model recently released. With the command below I got OOM error on a T4 16GB GPU.
How much GPU do I need to run the 7B model? In the Meta FAIR version of the model, we can adjust the max batch size to make it work on a single T4. What should be done here to make it work on a single T4 GPU? Thanks!
tokenizer = transformers.LlamaTokenizer.from_pretrained("/path/to/tokenizer/")
model = transformers.LlamaForSequenceClassification.from_pretrained("/path/to/llama-7b/")
To run the 7B model in full precision, you need 7 * 4 = 28GB of GPU RAM. You should add torch_dtype=torch.float16 to use half the memory and fit the model on a T4.
13*4 = 52 - this is the memory requirement for the inference. For the training, usually, you need more memory (depending on tensor Parallelism/ Pipeline parallelism/ Optimizer/ ZeRo offloading parameters/ framework and others). Contact me: https://www.linkedin.com/in/denistimonin/
Basicly the idea is that you store the row weights (weigths are store in 16bit parameters format) and you also need to store the gradient of the weights. As 1 bytes = 8 bits, you need 2B for every weights and another 2B for the gradient. And that’s only the case if you use SGD optimization because if you use ADAM as your optimizer, you need more memory per weights.
So you ends up with a raw memory requirement of 4*nb_parameters if you use SGD.
You can read the LoRa paper : https://arxiv.org/pdf/2106.09685.pdf, at the beginning they said that using lora for finetuning by 3 because you don’t have to store the gradient and gradient momentum of the optimizer.
in full precision (float32), every parameter of the model is stored in 32 bits or 4 bytes. Hence 4 bytes / parameter * 7 billion parameters = 28 billion bytes = 28 GB of GPU memory required, for inference only. In half precision, each parameter would be stored in 16 bits, or 2 bytes. Hence you would need 14 GB for inference. There are now also 8 bit and 4 bit algorithms, so with 4 bits (or half a byte) per parameter you would need 3.5 GB of memory for inference.
For training, it depends on the optimizer you use.
In case you use regular AdamW, then you need 8 bytes per parameter (as it not only stores the parameters, but also their gradients and second order gradients). Hence, for a 7B model you would need 8 bytes per parameter * 7 billion parameters = 56 GB of GPU memory. If you use AdaFactor, then you need 4 bytes per parameter, or 28 GB of GPU memory. With the optimizers of bitsandbytes (like 8 bit AdamW), you would need 2 bytes per parameter, or 14 GB of GPU memory.
Thanks much. This is very useful! I’m curious to learn more about bitsandbytes - e.g. AdamW 8bit to get it working w 14GB. Does anyone have the model on HF by using the last optimizer you mention?
–Aaron
Is your answer assuming a batch size of 1? In other words, how does the memory requirement change with the batch size? I think the number of parameters will remain the same, so we will not need additional memory to store them – the extra memory will be needed to store a bigger batch.