Memory Usage for Inference Depending on Size of Input Data

maxwellu · September 7, 2023, 7:10pm

For my work, I recently started looking at how to do inference from LLMs on the GPU. I got CUDA set up without any further problems.

I now tried to play around with the HuggingFace implementation of the No Language Left Behind translation model of Meta because I plan on translating large text corpora. Originally, I thought that I am just limitted by the VRAM that is needed to load the model on the GPU, however, that doesn’t seem to be the case.

After loading the 600m parameter version, I call torch.cuda.mem_get_info() to see that I am left with 21 GB of VRAM of the original 24 GB on one of the RTX 4090s.

Now I just tried something around to test the limits of the graphics card: I loaded a document with about 50,000 characters and tried to translate it with NLLB. But I get a CUDA out-of-memory error. This surprises me in that 50,000 characters should never use 20 GB of memory.

Can anyone guide me on how the memory usage for inference is computed? Apparently it is way more than just the size of the input data…

Please ignore that NLLB is not made to translate this large number of tokens at once. Again, I am more interest in the computational limits I have.

I already use torch.no_grad() and put the model in evaluation mode which I read online should safe some memory. My full code to run the inference looks like this:

device = "cuda:1" if torch.cuda.is_available() else "cpu"
tokenizer = AutoTokenizer.from_pretrained("facebook/nllb-200-distilled-600M")
model = AutoModelForSeq2SeqLM.from_pretrained("facebook/nllb-200-distilled-600M").to(device)
model.eval()

def translationPipeline(text):
    input_ids = tokenizer.encode(text, return_tensors="pt").to(device)
    with torch.no_grad():
        outputs = model.generate(input_ids,forced_bos_token_id=tokenizer.lang_code_to_id["deu_Latn"])
    decoded = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return decoded

# c = df_text.iloc[1,:].body.apply(translationPipeline)
with torch.no_grad():
    c = translationPipeline(df_text.iloc[0,:].body)

inkor · September 18, 2023, 3:33am

Yes. I also have this problem. I’m experimenting with llamas with 13 billion parameters in fp16, and if the context window is loaded to the maximum, then the memory can be almost twice as much as the model itself occupies in memory. But everywhere they write that the overhead is a maximum of 20 percent of the model.

Topic		Replies	Views
LLM ingores max_memory in inference Models	0	132	June 20, 2024
Loading of a model takes much RAM, passing to CUDA doesn't free RAM 🤗Transformers	0	776	August 8, 2021
GPU usage increasing every loop when running inference Beginners	2	1073	May 13, 2024
Memory overhead/usage calculation Intermediate	3	54	June 20, 2025
Why is the tensor produced by inference so big? Beginners	2	433	April 17, 2023

Memory Usage for Inference Depending on Size of Input Data

Related topics