Basically, it is determined in proportion to the file size of the model weight. There are several changes depending on whether or not to quantize with what bit, and if not quantized, whether to load with 32 bits or 16 bits.
In addition to the model size, VRAM is consumed according to the length of the generated context.
Roughly speaking, assuming 4-bit quantization, if the VRAM capacity is about the same as the model size in terms of the number of billion (B) parameters, it will work. (If you want to run an 4-bit quantized 8B model, you’ll need about 8GB. If your model is about 4.5GB, and you want to use the rest for inference… if it’s a 12B model, you’ll need about 12GB. This is just a rough guide.)
And since RoBerta is about 0.1B, you should be fine without quantization with 24GB VRAM.
Thank you. How can I tell what the quantization of a given model? The one I linked to says it is F32-I64. So it will not have 4-bit quantization correct?
Yes. In the case of RoBerta, it is probably uploaded as 32-bit float?
Well, type conversion is done by yourself as needed when loading (torch_dtype=torch.***). If you don’t do it, it will be loaded in the format it was uploaded in, or if it can’t be done, it will be loaded after converting it to a format that will not cause any problems.
Some are already quantized, but if they are not, you can do it yourself when loading (on-the-fly quantization).
It depends on the quantization algorithm, but in general, the smaller the size, the lower the precision. On the other hand, for calculations, the faster the format is for the GPU, the faster it will be on the GPU, so for example, bfloat16 is faster for inference on GeForce 30x0 and later, and the precision is also relatively good (if you ignore the bug related to attention).
Also, there are differences in speed and precision between quantization and inference for each quantization algorithm. For example, EXL2 takes a long time to quantize, but it is fast.
GGUF is used a lot because it is compatible with many software programs and the converted files are standalone, and bitsandbytes’ NF4 is used a lot because it was relatively quick to support it with the Hugging Face library and it is easy to use for on-the-fly quantization.
Of course, you can configure your zram memory and tune up all your resources to make it run smoothly even with a training package. A llama model is very light .