Optimizing Model Loading with a CPU Bottleneck

John6666 · August 12, 2025, 10:34am

google/mt5-xl

Sorry. That’s the general rule, but in this case, it seems to be a problem with the model. A 15GB file is stored without being split…

Recently, files are often saved in split form, which is more convenient when loading large models. The quickest solution would be to save it again yourself. You can either upload it or store it somewhere in advance and load it onto the GPU.

from transformers import AutoModelForSeq2SeqLM
model = AutoModelForSeq2SeqLM.from_pretrained("google/mt5-xl")
model.save_pretrained("mt5-xl-sft-2gb", safe_serialization=True, max_shard_size="2GB") # or "5GB", "10GB", etc.
#model.push_to_hub("mt5-xl-sft-2gb", safe_serialization=True, max_shard_size="2GB") # if uploading to Hugging Face Hub directly

Topic		Replies	Views
Loading model directly to GPU omitting RAM Beginners	6	84	March 28, 2025
General question about large model loading 🤗Accelerate	2	930	November 28, 2024
Accelerate use of memory 🤗Transformers	1	117	February 7, 2025
Why am I out of GPU memory despite using device_map="auto"? 🤗Accelerate	3	18412	March 18, 2024
Can't load huge model onto multiple GPU's Beginners	5	5265	June 15, 2023

Optimizing Model Loading with a CPU Bottleneck

Related topics