Memory Usage for Inference Depending on Size of Input Data

For my work, I recently started looking at how to do inference from LLMs on the GPU. I got CUDA set up without any further problems.

I now tried to play around with the HuggingFace implementation of the No Language Left Behind translation model of Meta because I plan on translating large text corpora. Originally, I thought that I am just limitted by the VRAM that is needed to load the model on the GPU, however, that doesn’t seem to be the case.

After loading the 600m parameter version, I call torch.cuda.mem_get_info() to see that I am left with 21 GB of VRAM of the original 24 GB on one of the RTX 4090s.

Now I just tried something around to test the limits of the graphics card: I loaded a document with about 50,000 characters and tried to translate it with NLLB. But I get a CUDA out-of-memory error. This surprises me in that 50,000 characters should never use 20 GB of memory.

Can anyone guide me on how the memory usage for inference is computed? Apparently it is way more than just the size of the input data…

Please ignore that NLLB is not made to translate this large number of tokens at once. Again, I am more interest in the computational limits I have.

I already use torch.no_grad() and put the model in evaluation mode which I read online should safe some memory. My full code to run the inference looks like this:

device = "cuda:1" if torch.cuda.is_available() else "cpu"
tokenizer = AutoTokenizer.from_pretrained("facebook/nllb-200-distilled-600M")
model = AutoModelForSeq2SeqLM.from_pretrained("facebook/nllb-200-distilled-600M").to(device)
model.eval()

def translationPipeline(text):
    input_ids = tokenizer.encode(text, return_tensors="pt").to(device)
    with torch.no_grad():
        outputs = model.generate(input_ids,forced_bos_token_id=tokenizer.lang_code_to_id["deu_Latn"])
    decoded = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return decoded

# c = df_text.iloc[1,:].body.apply(translationPipeline)
with torch.no_grad():
    c = translationPipeline(df_text.iloc[0,:].body)
2 Likes

Yes. I also have this problem. I’m experimenting with llamas with 13 billion parameters in fp16, and if the context window is loaded to the maximum, then the memory can be almost twice as much as the model itself occupies in memory. But everywhere they write that the overhead is a maximum of 20 percent of the model.