Multi GPU Build Possible?

I understand you want to inference a 32B?
You can do offloading. Here are the numbers.

FP32:
32 billion parameters * 4 bytes/parameter = 128 billion bytes
128 billion bytes / (1024 * 1024 * 1024) = 120 GB

FP16:
32 billion parameters * 2 bytes/parameter = 64 billion bytes
64 billion bytes / (1024 * 1024 * 1024) = 60 GB

Note:
This calculation only considers the memory required to store the model parameters themselves.
In reality, you’ll also need memory for:

  • Activations during inference
  • Optimizer states (if training)
  • Intermediate calculations
  • System overhead
    This means the actual memory requirements will be significantly higher.
1 Like