Optimizing Model Loading with a CPU Bottleneck

I am trying to load the pretrained mt5-xl in a GCP VM with 4 vCPUs, 15 GB Memory, and NVIDIA Tesla L4 GPU with 24 GB GPU Memory

Codewise simply

model = AutoModel.from_pretrained(“google/mt5-xl”)

My model loading is failing with RuntimeError: unable to mmap 14970735570 bytes from file both in CPU only VMs and with GPU VMs, indicating this is a model loading error rather than a GPU OOM memory.

The model loading seems to be hitting the CPU bottleneck before model is actually put on the GPU (GPU memory usage monitored with nvidia-smi

I have come across Accelerate Utilities and Big Modeling so I am working through debugging the device_map and max_memory to make this work.

But if there are solutions, or workarounds available, it would be much appreciated

low_cpu_mem_usage in `from_pretrained` did not make a difference.

With offload_state_dict=True about 25000 MB got loaded to GPU before getting the same error.

1 Like

Accelerate Utilities

I think that’s the correct workaround. If you want to treat VRAM, RAM, and disk as a single entity, you should use the Accelerate library.:sweat_smile:

Environments where RAM is less than VRAM are not commonly expected, so this issue is not often reported, but excessive RAM consumption during model loading can occasionally become an issue.

google/mt5-xl

Sorry. That’s the general rule, but in this case, it seems to be a problem with the model. A 15GB file is stored without being split…:sweat_smile:

Recently, files are often saved in split form, which is more convenient when loading large models. The quickest solution would be to save it again yourself. You can either upload it or store it somewhere in advance and load it onto the GPU.

from transformers import AutoModelForSeq2SeqLM
model = AutoModelForSeq2SeqLM.from_pretrained("google/mt5-xl")
model.save_pretrained("mt5-xl-sft-2gb", safe_serialization=True, max_shard_size="2GB") # or "5GB", "10GB", etc.
#model.push_to_hub("mt5-xl-sft-2gb", safe_serialization=True, max_shard_size="2GB") # if uploading to Hugging Face Hub directly
1 Like

Thanks. Realized that was the problem.
At that point I didn’t have a infra that could take care of sharding it either.

I now was able to get access to some L40 which circumvents this problem, and should help with solving this problem.

I will probably work through sharding this as you suggested and probably keep it as my copy version of the model.

I was hoping I could check with maintainers to actually add the safetensors (which is still an open PR on it) and then the sharded version too.

1 Like

I was hoping I could check with maintainers to actually add the safetensors (which is still an open PR on it) and then the sharded version too.

It seems that a PR has already been opened regarding that. All that’s needed is for the maintainer to merge it, but I guess it’s been forgotten…:sweat_smile: