TGI - use both GPU and CPU

markba · April 18, 2025, 4:31pm

I’m trying to run B32 models on AWS EC2 g6e.xlarge machine.
It has 1 GPU with 48gb and 4 CPUs with 32gb.

For example, log for “Qwen/Qwen2.5-32B-Instruct”:

Not enough VRAM to run the model: Available: 45.89GB - Model 60.83GB.
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 540.00 MiB. GPU 0 has a total capacity of 44.52 GiB of which 328.25 MiB is free.

Running such model with TGI quantization flag really helps: QUANTIZE=bitsandbytes.
The question is, is it possible to run this model as a full but using both GPU+CPU Memory, 48+32=80gb?

John6666 · April 19, 2025, 3:36am

If you specify device_map=“auto” or device_map=“sequential”, or so, I think both VRAM and RAM will be used. I don’t know what the default setting is.

Topic		Replies	Views
Multi GPU Build Possible? Beginners	2	231	January 19, 2025
How can I make use of GPU manually to run inference faster? 🤗Transformers	3	35	April 22, 2025
codellama/CodeLlama-70b-Instruct-hf TGI server out-of-memory error in H100 Models	2	286	March 22, 2024
Question about memory usage Beginners	0	927	May 15, 2023
Uploading a space on paid GPU's Spaces	4	26	June 6, 2025

TGI - use both GPU and CPU

Related topics