LLaMA 7B GPU Memory Requirement

Hi, I wanted to play with the LLaMA 7B model recently released. With the command below I got OOM error on a T4 16GB GPU.
How much GPU do I need to run the 7B model? In the Meta FAIR version of the model, we can adjust the max batch size to make it work on a single T4. What should be done here to make it work on a single T4 GPU? Thanks!

tokenizer = transformers.LlamaTokenizer.from_pretrained("/path/to/tokenizer/")
model = transformers.LlamaForSequenceClassification.from_pretrained("/path/to/llama-7b/")
1 Like

To run the 7B model in full precision, you need 7 * 4 = 28GB of GPU RAM. You should add torch_dtype=torch.float16 to use half the memory and fit the model on a T4.

12 Likes

How much would 13B take, 13*4 = 52 GB?

We are getting a CUDA OOM error while finetuning a 13B Llama model on a 4xA100 cluster, what may we be doing wrong

13*4 = 52 - this is the memory requirement for the inference. For the training, usually, you need more memory (depending on tensor Parallelism/ Pipeline parallelism/ Optimizer/ ZeRo offloading parameters/ framework and others). Contact me: https://www.linkedin.com/in/denistimonin/

1 Like

@sgugger what is the reasoning behind needing 7 * 4 = 28 GB?

Or, what resource would you consult to gain this insight?

Basicly the idea is that you store the row weights (weigths are store in 16bit parameters format) and you also need to store the gradient of the weights. As 1 bytes = 8 bits, you need 2B for every weights and another 2B for the gradient. And that’s only the case if you use SGD optimization because if you use ADAM as your optimizer, you need more memory per weights.
So you ends up with a raw memory requirement of 4*nb_parameters if you use SGD.

1 Like

You can read the LoRa paper : https://arxiv.org/pdf/2106.09685.pdf, at the beginning they said that using lora for finetuning by 3 because you don’t have to store the gradient and gradient momentum of the optimizer.

Hi @Forbu14,

in full precision (float32), every parameter of the model is stored in 32 bits or 4 bytes. Hence 4 bytes / parameter * 7 billion parameters = 28 billion bytes = 28 GB of GPU memory required, for inference only. In half precision, each parameter would be stored in 16 bits, or 2 bytes. Hence you would need 14 GB for inference. There are now also 8 bit and 4 bit algorithms, so with 4 bits (or half a byte) per parameter you would need 3.5 GB of memory for inference. However usually there’s also some additional overhead as you generate tokens, see this nice blog post: Calculating GPU memory for serving LLMs | Substratus.AI.

For training, it depends on the optimizer you use and whether you use full fine-tuning vs. PEFT (e.g. QLoRa).

In case you use regular AdamW, then you need 8 bytes per parameter (as it not only stores the parameters, but also their gradients and second order gradients). Hence, for a 7B model you would need 8 bytes per parameter * 7 billion parameters = 56 GB of GPU memory. If you use AdaFactor, then you need 4 bytes per parameter, or 28 GB of GPU memory. With the optimizers of bitsandbytes (like 8 bit AdamW), you would need 2 bytes per parameter, or 14 GB of GPU memory.

In case you use parameter-efficient methods like QLoRa, memory requirements are greatly reduced: Making LLMs even more accessible with bitsandbytes, 4-bit quantization and QLoRA. Basically one quantizes the base model in 8 or 4 bits and then train adapters on top in float16.

I highly recommend this guide: Methods and tools for efficient training on a single GPU which goes over all of this in much more detail.

46 Likes

Thanks much. This is very useful! I’m curious to learn more about bitsandbytes - e.g. AdamW 8bit to get it working w 14GB. Does anyone have the model on HF by using the last optimizer you mention?
–Aaron

@nielsr
Thank you for your explanation.

Is your answer assuming a batch size of 1? In other words, how does the memory requirement change with the batch size? I think the number of parameters will remain the same, so we will not need additional memory to store them – the extra memory will be needed to store a bigger batch.

3 Likes

Hi,
The weights provided by meta (non-hf) are about 13GB in size. And they run as is on a 16GB Vram. Why is there a large difference in the sizes?

2 Likes

Bonjour Sylvain
Any experience in running LLaMA-7B on a RTX 3060 ?
Thanks!
Alexis

I have fine-tuned llama 2 7-b on kaggle 30GB vram with Lora , But iam unable to merge adpater weights with model. How much ram does merging takes?

Hey, during training, we require 56GB for parameter and gradients for each parameter. However there will be some additional requirements of memory for optimizer states. How much that would be? And what are the optimizer states basically ??

I would try it out on Inference Endpoints AWS with the 1x Nvidia A10G card which has 24GB RAM first. Most models that size require an A10. If that doesn’t work your next option is an A100 which is quite a bit more $.

I run Llama 7b on an A10 and it seems the perfect fit. Rate is $ 1.3 /h while running and if you set KEDA (Kubernetes Event Driven Autoscaler) setting to sleep at 15 minutes you can minimize cost at the expense of about a 1 minute spin up time on non use. This can be a bit of a drag however its the best way to be responsible about cost and resource usage.

Good luck! Let us know how it goes. HF’s Inference Endpoints is the easiest fastest way to spin up model copies on required GPU hardware. It is cheaper as well than other cloud options and I’ve seen it cost something like half of other cloud offerings with better and easier features to spin up and down.

If you go here you can explore it: https://ui.endpoints.huggingface.co/

1 Like

@sgugger I have 3060 laptop GPU. How can I run 7b-chat? Do you think I change anything to run it?

torchrun --nproc_per_node 1 example_chat_completion.py \
    --ckpt_dir llama-2-7b-chat/ \
    --tokenizer_path tokenizer.model \
    --max_seq_len 512 --max_batch_size 6
> initializing model parallel with size 1
> initializing ddp with size 1
> initializing pipeline with size 1
/home/aryan/miniconda3/envs/pytorch/lib/python3.12/site-packages/torch/__init__.py:696: UserWarning: torch.set_default_tensor_type() is deprecated as of PyTorch 2.1, please use torch.set_default_dtype() and torch.set_default_device() as alternatives. (Triggered internally at /opt/conda/conda-bld/pytorch_1708025845206/work/torch/csrc/tensor/python_tensor.cpp:451.)
  _C._set_default_tensor_type(t)
Traceback (most recent call last):
  File "/media/aryan/sandisk_ex/llama2/llama/example_chat_completion.py", line 104, in <module>
    fire.Fire(main)
  File "/home/aryan/miniconda3/envs/pytorch/lib/python3.12/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/aryan/miniconda3/envs/pytorch/lib/python3.12/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
                                ^^^^^^^^^^^^^^^^^^^^
  File "/home/aryan/miniconda3/envs/pytorch/lib/python3.12/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
                ^^^^^^^^^^^^^^^^^^^^^^
  File "/media/aryan/sandisk_ex/llama2/llama/example_chat_completion.py", line 35, in main
    generator = Llama.build(
                ^^^^^^^^^^^^
  File "/media/aryan/sandisk_ex/llama2/llama/llama/generation.py", line 119, in build
    model = Transformer(model_args)
            ^^^^^^^^^^^^^^^^^^^^^^^
  File "/media/aryan/sandisk_ex/llama2/llama/llama/model.py", line 443, in __init__
    self.layers.append(TransformerBlock(layer_id, params))
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/media/aryan/sandisk_ex/llama2/llama/llama/model.py", line 375, in __init__
    self.attention = Attention(args)
                     ^^^^^^^^^^^^^^^
  File "/media/aryan/sandisk_ex/llama2/llama/llama/model.py", line 228, in __init__
    self.wo = RowParallelLinear(
              ^^^^^^^^^^^^^^^^^^
  File "/home/aryan/miniconda3/envs/pytorch/lib/python3.12/site-packages/fairscale/nn/model_parallel/layers.py", line 349, in __init__
    self.weight = Parameter(torch.Tensor(self.out_features, self.input_size_per_partition))
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 32.00 MiB. GPU 0 has a total capacity of 5.77 GiB of which 39.12 MiB is free. Process 35536 has 17.52 MiB memory in use. Including non-PyTorch memory, this process has 5.12 GiB memory in use. Of the allocated memory 5.00 GiB is allocated by PyTorch, and 1.83 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
[2024-03-09 00:21:33,658] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 62595) of binary: /home/aryan/miniconda3/envs/pytorch/bin/python
Traceback (most recent call last):
  File "/home/aryan/miniconda3/envs/pytorch/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==2.2.1', 'console_scripts', 'torchrun')())
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/aryan/miniconda3/envs/pytorch/lib/python3.12/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/home/aryan/miniconda3/envs/pytorch/lib/python3.12/site-packages/torch/distributed/run.py", line 812, in main
    run(args)
  File "/home/aryan/miniconda3/envs/pytorch/lib/python3.12/site-packages/torch/distributed/run.py", line 803, in run
    elastic_launch(
  File "/home/aryan/miniconda3/envs/pytorch/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 135, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/aryan/miniconda3/envs/pytorch/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
example_chat_completion.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-03-09_00:21:33
  host      : ar
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 62595)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
1 Like

What is the best way to estimate which model can be run on a given GPU to learn to run llm models?

1 Like

You can use this Space: Model Memory Utility - a Hugging Face Space by hf-accelerate.

2 Likes

You can refer to this excellent blog: https://blog.eleuther.ai/transformer-math/

1 Like