Why the memory usage is higher than expected when loading nvidia/NV-Embed-v2 model with FP16 precision?

Arunima693 · December 6, 2024, 8:10am

Hi,

I am using the nvidia/NV-Embed-v2 model with 7.85 billion parameters to generate embeddings. I’m loading the model with FP16 precision using the following code:

import torch
from transformers import AutoTokenizer, AutoModel

torch.cuda.empty_cache()
device = torch.device('cuda')
model = AutoModel.from_pretrained('nvidia/NV-Embed-v2', trust_remote_code=True).to(device)
model.to(torch.float16)

I expect the model to take approximately 15.7 GB of memory (calculated as 7.85×27.85 \times 27.85×2 GB for FP16 precision). However, when I measure the memory usage with this code—without making any inference—the reported memory consumption is about 25 GB.

Can someone please explain why the memory usage is higher than expected? Is this extra memory usage due to model overhead (e.g., optimizer states, activation storage) or something else?

Thanks in advance!

Topic		Replies	Views
GPU memory usage is twice (2x) what I calculated based on number of parameters and floating point precision Intermediate	5	445	May 18, 2024
Question about FP16/32, LoRA and GPU Memory Usage 🤗Transformers	1	3784	September 18, 2023
Double expected memory usage Beginners	1	1412	August 17, 2022
Question about memory usage Beginners	0	921	May 15, 2023
The CPU memory usage becomes very small during model inference 🤗Transformers	0	48	November 30, 2024

Why the memory usage is higher than expected when loading nvidia/NV-Embed-v2 model with FP16 precision?

Related topics