Optimizing Model Loading with a CPU Bottleneck

Abhishek-P · August 12, 2025, 7:51am

I am trying to load the pretrained mt5-xl in a GCP VM with 4 vCPUs, 15 GB Memory, and NVIDIA Tesla L4 GPU with 24 GB GPU Memory

Codewise simply

model = AutoModel.from_pretrained(“google/mt5-xl”)

My model loading is failing with RuntimeError: unable to mmap 14970735570 bytes from file both in CPU only VMs and with GPU VMs, indicating this is a model loading error rather than a GPU OOM memory.

The model loading seems to be hitting the CPU bottleneck before model is actually put on the GPU (GPU memory usage monitored with nvidia-smi

I have come across Accelerate Utilities and Big Modeling so I am working through debugging the device_map and max_memory to make this work.

But if there are solutions, or workarounds available, it would be much appreciated

low_cpu_mem_usage in `from_pretrained` did not make a difference.

With offload_state_dict=True about 25000 MB got loaded to GPU before getting the same error.

Topic		Replies	Views
Loading model directly to GPU omitting RAM Beginners	6	89	March 28, 2025
General question about large model loading 🤗Accelerate	2	938	November 28, 2024
Accelerate use of memory 🤗Transformers	1	126	February 7, 2025
Why am I out of GPU memory despite using device_map="auto"? 🤗Accelerate	3	18640	March 18, 2024
Can't load huge model onto multiple GPU's Beginners	5	5286	June 15, 2023

Optimizing Model Loading with a CPU Bottleneck

Related topics