How estimate VRAM needed for prompt according to prompt's size (inference and fine tuning)

EquinoxElahin · September 21, 2023, 9:31am

Hi, I noticed give a LLM a huge prompt (4000 tokens) can consume something around 6G-VRAM. (8 bits model)
So it’s really difficult to fine tune on huge prompt when you use free ColabT4.
My point is does someone can help me explaining the operations behind the VRAm consumption (in regard to the length of the prompt) when inference and fine tuning using Lora ?
I distingue both because when Fine tuning on a causal task it consumes the inference basis + the needed for fine tuning. (Gradients: 2 bytes for parameters + same for optimizers I read)
And this point is confusing me.
I meant, Alright doing inference on huge prompt the llms needs to keep in its embeddings the 4000 previous token it has seen. But I don’t understand why it is required AT THE BEGINNING of fine tuning, it should start with the first token of my prompt, predicts the next one. Apply the loss on the predictions and compare with the true next token on my corpus, adjust the weights, and so on until the 4000th token (at this point I would understand it consumes more VRAM than T4 has, but the consumption seems high extremly quickly, so I guess at the beginning of the process).

Can someone enlight me?
Thanks,

julien-c · September 22, 2023, 10:00am

maybe interesting for @muellerzr !

Topic		Replies	Views
LoRa fine tuning a chatbot on 6GB VRAM GPU Beginners	1	301	January 21, 2025
Optimizing LLM Inference with One Base LLM and Multiple LoRA Adapters for Memory Efficiency 🤗Transformers	1	4650	January 20, 2024
How to calculate the memory required using Lora fine tuning Models	0	952	November 21, 2023
LoRA / QLoRA fine tuning a 8b Model(llama 3.1) Beginners	1	297	February 24, 2025
Memory Usage for Inference Depending on Size of Input Data 🤗Transformers	1	4428	September 18, 2023

How estimate VRAM needed for prompt according to prompt's size (inference and fine tuning)

Related topics