Determining if a model will run locally

John6666 · April 6, 2025, 10:10am

Basically, it is determined in proportion to the file size of the model weight. There are several changes depending on whether or not to quantize with what bit, and if not quantized, whether to load with 32 bits or 16 bits.
In addition to the model size, VRAM is consumed according to the length of the generated context.

Roughly speaking, assuming 4-bit quantization, if the VRAM capacity is about the same as the model size in terms of the number of billion (B) parameters, it will work. (If you want to run an 4-bit quantized 8B model, you’ll need about 8GB. If your model is about 4.5GB, and you want to use the rest for inference… if it’s a 12B model, you’ll need about 12GB. This is just a rough guide.)

And since RoBerta is about 0.1B, you should be fine without quantization with 24GB VRAM.

Topic		Replies	Views
Local HW specs for Hosting meta-llama/Llama-3.2-11B-Vision-Instruct 🤗Transformers	4	1673	October 28, 2024
Best LLMs that can run on 4gb VRAM Beginners	2	3062	January 22, 2025
Should I just get more RAM? Beginners	4	1995	December 22, 2024
How to quickly determine memory requirements for model Beginners	7	22524	August 28, 2024
Which solution is best suited in my case? Beginners	2	67	October 17, 2024

Determining if a model will run locally

Related topics