CUDA OOM when export a large model to ONNX

in-certo · February 17, 2023, 3:08am

I got OOM when exporting a large model to ONNX. I wonder how Optimum handles this issue.

Here are my settings

The code works for a smaller model with fewer parameters, so the error is due to the model size.
Not able to export the model on the CPU because of fp16.
The pure model takes 20GB of CUDA memory, and the total GPU capacity is 80G. ( It seems 10x memory will be consumed for exporting a small model. 2GB->20GB)
Running multiple forward passes before export won’t cause any trouble.
A greedy search is implemented in the graph to generate 32 tokens. A lot of intermediate past_key_values are cached.

in-certo · February 17, 2023, 8:34am

Very odd. It seems that the time and memory consumed to export a jit.ScriptModule are proportional to the loop size.

If this is true, it seems impossible to export a model with a decoding method into the ONNX computation graph.

class Model2(nn.Module):
    def forward(self, x):
        for i in range(2):
            x *= x
        return x

class Model32(nn.Module):
    def forward(self, x):
        for i in range(32):
            x *= x
        return x

fxmarty · February 17, 2023, 9:27am

Thanks @in-certo , could it be linked to this issue? `torch.jit.trace` memory usage increase although forward is constant, and gets much slower than forward with model depth increase · Issue #93943 · pytorch/pytorch · GitHub

I witnessed as well the memory usage increasing with the number of loops when using torch.jit.trace with stable diffusion.

in-certo · February 17, 2023, 9:33am

Yes, exactly.

Here is another related issue: ONNX model file exported from Transformer decoder is too large · Issue #4319 · onnx/onnx (github.com)

implank · June 26, 2025, 1:02pm

I tried to convert a PyTorch model to ONNX, but encountered an OOM error. However, using inference directly can be successful. After adding ‘with torch. reference_made()’ before ‘torch. onnx. export’, I was able to export the model as onnx without oom

John6666 · June 26, 2025, 1:08pm

For future reference…

As of today, if you want to convert a model to ONNX, it seems that using the frequently maintained Optimum library is generally recommended.

surfiniaburger · July 26, 2025, 2:25pm

I tried it but still got oom error print("\nStep 2: Preparing and running the ONNX export...") try: # The zsh: killederror is a classic Out-Of-Memory (OOM) error from the OS. # Your insight abouttorch.inference_mode()(you mentionedtorch.reference_made()`,

# by disabling all gradient calculations during loading and exporting.
with torch.inference_mode():
    # --- Pre-load check and fix for empty index files ---
    index_path = Path(pytorch_model_path) / "model.safetensors.index.json"
    if index_path.exists() and index_path.stat().st_size == 0:
        print(f"⚠️  Found an empty index file at: {index_path}")
        print("   This can cause loading errors. Removing it to proceed.")
        os.remove(index_path)
        print("   ✅ Empty index file removed.")

    # --- MPS DEBUGGING ---
    # Forcing CPU to bypass any MPS-specific bugs.
    device = "cpu"
    print(f"Using device: {device}")

    # Load the model and config ONCE to have better control over memory.
    print("Loading model and config from disk...")
    main_config = AutoConfig.from_pretrained(pytorch_model_path, trust_remote_code=True)
    model = AutoModelForCausalLM.from_pretrained(
        pytorch_model_path,
        config=main_config,
        trust_remote_code=True,
    ).to(device)
    print("Model loaded.")

    custom_onnx_configs = {
        "decoder_model": CustomGemma3NMultimodalOnnxConfig(config=main_config, task="text-generation", use_past=False),
        "decoder_with_past_model": CustomGemma3NMultimodalOnnxConfig(config=main_config, task="text-generation", use_past=True),
    }
    # Use the more direct `onnx_export_from_model` which takes a pre-loaded model object.
    onnx_export_from_model(
         model=model,
         output=Path(onnx_output_path),
         task="text-generation-with-past",
         custom_onnx_configs=custom_onnx_configs,
         fn_get_submodels=get_submodels,
         opset=14,
         do_validation=False,
         device=device,
     )
    print("\n✅ ONNX conversion process completed successfully!")
    print(f"   The exported model is saved in: {Path(onnx_output_path).resolve()}")

except Exception:
print(f"\n❌ An error occurred during the ONNX conversion process.“); print(”— FULL TRACEBACK —“); traceback.print_exc(); print(”— END OF TRACEBACK —")`

Topic		Replies	Views
How does the ONNX exporter work for GenerationModel with `past_key_value`? 🤗Optimum	9	2410	February 17, 2023
ONNX Flan-T5 Model OOM on GPU 🤗Optimum	2	2639	June 15, 2023
Qwen/Qwen1.5-7B-Chat RuntimeError: The serialized model is larger than the 2GiB ORTModelForCausalLM 🤗Optimum	2	442	January 1, 2025
Cannot export to ONNX with optimum.onnxruntime 🤗Optimum	0	916	February 28, 2024
Onnx export functionality failure for facebook/opt-2.7b with optimum CLI 🤗Transformers	0	336	October 11, 2023

CUDA OOM when export a large model to ONNX

Related topics