Hi, I wanted to play with the LLaMA 7B model recently released. With the command below I got OOM error on a T4 16GB GPU.
How much GPU do I need to run the 7B model? In the Meta FAIR version of the model, we can adjust the max batch size to make it work on a single T4. What should be done here to make it work on a single T4 GPU? Thanks!
tokenizer = transformers.LlamaTokenizer.from_pretrained("/path/to/tokenizer/")
model = transformers.LlamaForSequenceClassification.from_pretrained("/path/to/llama-7b/")
To run the 7B model in full precision, you need 7 * 4 = 28GB of GPU RAM. You should add
torch_dtype=torch.float16 to use half the memory and fit the model on a T4.
How much would 13B take, 13*4 = 52 GB?
We are getting a CUDA OOM error while finetuning a 13B Llama model on a 4xA100 cluster, what may we be doing wrong
13*4 = 52 - this is the memory requirement for the inference. For the training, usually, you need more memory (depending on tensor Parallelism/ Pipeline parallelism/ Optimizer/ ZeRo offloading parameters/ framework and others). Contact me: https://www.linkedin.com/in/denistimonin/
@sgugger what is the reasoning behind needing 7 * 4 = 28 GB?
Or, what resource would you consult to gain this insight?
Basicly the idea is that you store the row weights (weigths are store in 16bit parameters format) and you also need to store the gradient of the weights. As 1 bytes = 8 bits, you need 2B for every weights and another 2B for the gradient. And that’s only the case if you use SGD optimization because if you use ADAM as your optimizer, you need more memory per weights.
So you ends up with a raw memory requirement of 4*nb_parameters if you use SGD.
You can read the LoRa paper : https://arxiv.org/pdf/2106.09685.pdf, at the beginning they said that using lora for finetuning by 3 because you don’t have to store the gradient and gradient momentum of the optimizer.
in full precision (float32), every parameter of the model is stored in 32 bits or 4 bytes. Hence 4 bytes / parameter * 7 billion parameters = 28 billion bytes = 28 GB of GPU memory required, for inference only. In half precision, each parameter would be stored in 16 bits, or 2 bytes. Hence you would need 14 GB for inference. There are now also 8 bit and 4 bit algorithms, so with 4 bits (or half a byte) per parameter you would need 3.5 GB of memory for inference.
For training, it depends on the optimizer you use.
In case you use regular AdamW, then you need 8 bytes per parameter (as it not only stores the parameters, but also their gradients and second order gradients). Hence, for a 7B model you would need 8 bytes per parameter * 7 billion parameters = 56 GB of GPU memory. If you use AdaFactor, then you need 4 bytes per parameter, or 28 GB of GPU memory. With the optimizers of bitsandbytes (like 8 bit AdamW), you would need 2 bytes per parameter, or 14 GB of GPU memory.
I highly recommend this guide: Efficient Training on a Single GPU which goes over all of this in much more detail.
Thanks much. This is very useful! I’m curious to learn more about bitsandbytes - e.g. AdamW 8bit to get it working w 14GB. Does anyone have the model on HF by using the last optimizer you mention?
Thank you for your explanation.
Is your answer assuming a batch size of 1? In other words, how does the memory requirement change with the batch size? I think the number of parameters will remain the same, so we will not need additional memory to store them – the extra memory will be needed to store a bigger batch.
The weights provided by meta (non-hf) are about 13GB in size. And they run as is on a 16GB Vram. Why is there a large difference in the sizes?
Any experience in running LLaMA-7B on a RTX 3060 ?
I have fine-tuned llama 2 7-b on kaggle 30GB vram with Lora , But iam unable to merge adpater weights with model. How much ram does merging takes?