I’m trying to run B32 models on AWS EC2 g6e.xlarge machine.
It has 1 GPU with 48gb and 4 CPUs with 32gb.
For example, log for “Qwen/Qwen2.5-32B-Instruct”:
Not enough VRAM to run the model: Available: 45.89GB - Model 60.83GB.
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 540.00 MiB. GPU 0 has a total capacity of 44.52 GiB of which 328.25 MiB is free.
Running such model with TGI quantization flag really helps: QUANTIZE=bitsandbytes.
The question is, is it possible to run this model as a full but using both GPU+CPU Memory, 48+32=80gb?