Hi,
I am using the nvidia/NV-Embed-v2
model with 7.85 billion parameters to generate embeddings. I’m loading the model with FP16 precision using the following code:
import torch
from transformers import AutoTokenizer, AutoModel
torch.cuda.empty_cache()
device = torch.device('cuda')
model = AutoModel.from_pretrained('nvidia/NV-Embed-v2', trust_remote_code=True).to(device)
model.to(torch.float16)
I expect the model to take approximately 15.7 GB of memory (calculated as 7.85×27.85 \times 27.85×2 GB for FP16 precision). However, when I measure the memory usage with this code—without making any inference—the reported memory consumption is about 25 GB.
Can someone please explain why the memory usage is higher than expected? Is this extra memory usage due to model overhead (e.g., optimizer states, activation storage) or something else?
Thanks in advance!