I understand you want to inference a 32B?
You can do offloading. Here are the numbers.
FP32:
32 billion parameters * 4 bytes/parameter = 128 billion bytes
128 billion bytes / (1024 * 1024 * 1024) = 120 GB
FP16:
32 billion parameters * 2 bytes/parameter = 64 billion bytes
64 billion bytes / (1024 * 1024 * 1024) = 60 GB
Note:
This calculation only considers the memory required to store the model parameters themselves.
In reality, you’ll also need memory for:
- Activations during inference
- Optimizer states (if training)
- Intermediate calculations
- System overhead
This means the actual memory requirements will be significantly higher.