Optimizing Model Loading with a CPU Bottleneck

I am trying to load the pretrained mt5-xl in a GCP VM with 4 vCPUs, 15 GB Memory, and NVIDIA Tesla L4 GPU with 24 GB GPU Memory

Codewise simply

model = AutoModel.from_pretrained(“google/mt5-xl”)

My model loading is failing with RuntimeError: unable to mmap 14970735570 bytes from file both in CPU only VMs and with GPU VMs, indicating this is a model loading error rather than a GPU OOM memory.

The model loading seems to be hitting the CPU bottleneck before model is actually put on the GPU (GPU memory usage monitored with nvidia-smi

I have come across Accelerate Utilities and Big Modeling so I am working through debugging the device_map and max_memory to make this work.

But if there are solutions, or workarounds available, it would be much appreciated

low_cpu_mem_usage in `from_pretrained` did not make a difference.

With offload_state_dict=True about 25000 MB got loaded to GPU before getting the same error.

1 Like