For my work, I recently started looking at how to do inference from LLMs on the GPU. I got CUDA set up without any further problems.
I now tried to play around with the HuggingFace implementation of the No Language Left Behind translation model of Meta because I plan on translating large text corpora. Originally, I thought that I am just limitted by the VRAM that is needed to load the model on the GPU, however, that doesn’t seem to be the case.
After loading the 600m parameter version, I call
torch.cuda.mem_get_info() to see that I am left with 21 GB of VRAM of the original 24 GB on one of the RTX 4090s.
Now I just tried something around to test the limits of the graphics card: I loaded a document with about 50,000 characters and tried to translate it with NLLB. But I get a CUDA out-of-memory error. This surprises me in that 50,000 characters should never use 20 GB of memory.
Can anyone guide me on how the memory usage for inference is computed? Apparently it is way more than just the size of the input data…
Please ignore that NLLB is not made to translate this large number of tokens at once. Again, I am more interest in the computational limits I have.
I already use
torch.no_grad() and put the model in evaluation mode which I read online should safe some memory. My full code to run the inference looks like this:
device = "cuda:1" if torch.cuda.is_available() else "cpu" tokenizer = AutoTokenizer.from_pretrained("facebook/nllb-200-distilled-600M") model = AutoModelForSeq2SeqLM.from_pretrained("facebook/nllb-200-distilled-600M").to(device) model.eval() def translationPipeline(text): input_ids = tokenizer.encode(text, return_tensors="pt").to(device) with torch.no_grad(): outputs = model.generate(input_ids,forced_bos_token_id=tokenizer.lang_code_to_id["deu_Latn"]) decoded = tokenizer.decode(outputs, skip_special_tokens=True) return decoded # c = df_text.iloc[1,:].body.apply(translationPipeline) with torch.no_grad(): c = translationPipeline(df_text.iloc[0,:].body)