Multi GPU Build Possible?

I’m a little worried that I’m spending a considerable amount with no ability to actually build something that can handle a 32B LLM. I get mixed messages. Either they reassure me it’ll work with a bunch of janky PCIe mining risers or they call me an idiot… I’d just like to know how to get started. Help? I’m a noob. I know, but we all have to start some where. Be gentle…

Intel® Core™ i9 14900KF processor

2x

Kingston FURY Beast 64GB DDR5 RAM 5600MT/s CL36 (svart)

x4

MSI GeForce RTX 4060 Ti Ventus 2X Black OC 16GB grafikkort

MSI PRO Z790-A MAX WIFI ATX LGA1700 Motherboard

Corsair HX1500i ATX 3.0 1500W

Noctua NH-L9x65 chromax. black

Kingston Fury Renegade med Kylfläns 2TB

1 Like

MSI GeForce RTX 4060 Ti Ventus 2X Black OC 16GB grafikkort

It’s much more powerful than my GPU, but even in 4-bit quantization, the 32B model is about 20GB in size, and it also consumes a little extra VRAM at runtime, so I think it will exceed the 16GB VRAM and spill over into RAM. Well, it should work because there is enough RAM, but it’s not clear whether it will be usable comfortably.
If you have a model with around 16B, you should be able to use it comfortably with 10GB in 4-bit quantization. If you use 16-bit precision without quantization, even an 8B model will run out of VRAM…:sweat_smile:
In this case, you could either sacrifice precision and use a 32B model with 3-bit or 2-bit quantization, use a more powerful GPU, compromise with a smaller LLM, or put up with a little slower speed.

Oh, if you have a multi-GPU (16GBx2), you should be fine as long as you use 4-bit quantization. According to reports on forums, etc., it seems that the load may not be evenly distributed in some cases, but there are often no particular problems with executing the model itself.

I understand you want to inference a 32B?
You can do offloading. Here are the numbers.

FP32:
32 billion parameters * 4 bytes/parameter = 128 billion bytes
128 billion bytes / (1024 * 1024 * 1024) = 120 GB

FP16:
32 billion parameters * 2 bytes/parameter = 64 billion bytes
64 billion bytes / (1024 * 1024 * 1024) = 60 GB

Note:
This calculation only considers the memory required to store the model parameters themselves.
In reality, you’ll also need memory for:

  • Activations during inference
  • Optimizer states (if training)
  • Intermediate calculations
  • System overhead
    This means the actual memory requirements will be significantly higher.
1 Like