Optimizing Model Loading with a CPU Bottleneck

Abhishek-P · August 12, 2025, 7:51am

I am trying to load the pretrained mt5-xl in a GCP VM with 4 vCPUs, 15 GB Memory, and NVIDIA Tesla L4 GPU with 24 GB GPU Memory

Codewise simply

model = AutoModel.from_pretrained(“google/mt5-xl”)

My model loading is failing with RuntimeError: unable to mmap 14970735570 bytes from file both in CPU only VMs and with GPU VMs, indicating this is a model loading error rather than a GPU OOM memory.

The model loading seems to be hitting the CPU bottleneck before model is actually put on the GPU (GPU memory usage monitored with nvidia-smi

I have come across Accelerate Utilities and Big Modeling so I am working through debugging the device_map and max_memory to make this work.

But if there are solutions, or workarounds available, it would be much appreciated

low_cpu_mem_usage in `from_pretrained` did not make a difference.

With offload_state_dict=True about 25000 MB got loaded to GPU before getting the same error.

John6666 · August 12, 2025, 10:24am

Accelerate Utilities

I think that’s the correct workaround. If you want to treat VRAM, RAM, and disk as a single entity, you should use the Accelerate library.

Environments where RAM is less than VRAM are not commonly expected, so this issue is not often reported, but excessive RAM consumption during model loading can occasionally become an issue.

infer_auto_device_map inefficiently allocates GPU memory for models with imbalanced module sizes
device_map=‘auto’ causes memory to not be freed with torch.cuda.empty_cache()
How to avoid the peak RAM memory usage of a model when I want to load to GPU

John6666 · August 12, 2025, 10:34am

google/mt5-xl

Sorry. That’s the general rule, but in this case, it seems to be a problem with the model. A 15GB file is stored without being split…

Recently, files are often saved in split form, which is more convenient when loading large models. The quickest solution would be to save it again yourself. You can either upload it or store it somewhere in advance and load it onto the GPU.

from transformers import AutoModelForSeq2SeqLM
model = AutoModelForSeq2SeqLM.from_pretrained("google/mt5-xl")
model.save_pretrained("mt5-xl-sft-2gb", safe_serialization=True, max_shard_size="2GB") # or "5GB", "10GB", etc.
#model.push_to_hub("mt5-xl-sft-2gb", safe_serialization=True, max_shard_size="2GB") # if uploading to Hugging Face Hub directly

Abhishek-P · August 17, 2025, 2:24am

Thanks. Realized that was the problem.
At that point I didn’t have a infra that could take care of sharding it either.

I now was able to get access to some L40 which circumvents this problem, and should help with solving this problem.

I will probably work through sharding this as you suggested and probably keep it as my copy version of the model.

I was hoping I could check with maintainers to actually add the safetensors (which is still an open PR on it) and then the sharded version too.

John6666 · August 17, 2025, 3:23am

I was hoping I could check with maintainers to actually add the safetensors (which is still an open PR on it) and then the sharded version too.

It seems that a PR has already been opened regarding that. All that’s needed is for the maintainer to merge it, but I guess it’s been forgotten…

Topic		Replies	Views
Loading model directly to GPU omitting RAM Beginners	6	84	March 28, 2025
General question about large model loading 🤗Accelerate	2	930	November 28, 2024
Accelerate use of memory 🤗Transformers	1	117	February 7, 2025
Why am I out of GPU memory despite using device_map="auto"? 🤗Accelerate	3	18412	March 18, 2024
Can't load huge model onto multiple GPU's Beginners	5	5265	June 15, 2023

Optimizing Model Loading with a CPU Bottleneck

Related topics