For Transformer.js:
Use main_export() with custom_onnx_configs and with_behavior(..., use_past=True) to get the trio. Do not monkey-patch.
Background and context
- Why a “trio”: seq2seq generation needs a one-off decoder for the first token and a decoder_with_past for subsequent tokens so KV-cache is reused. This is the supported pattern. (Hugging Face Forums)
- Where to set it: Optimum’s exporter lets you pass custom_onnx_configs to
main_export()and choose behaviors per subgraph:"encoder","decoder", and"decoder with past". You can also disable post-processing so files are kept separate. (Hugging Face) - Transformers.js expects this layout. Public web-ready repos ship
onnx/{encoder_model.onnx, decoder_model.onnx, decoder_with_past_model.onnx}or a merged decoder. (Hugging Face)
Minimal, correct export (no patches)
# refs:
# - Export guide (custom_onnx_configs + with_behavior + no_post_process):
# https://huggingface.co/docs/optimum-onnx/onnx/usage_guides/export_a_model
# - main_export reference:
# https://huggingface.co/docs/optimum-onnx/en/onnx/package_reference/export
from pathlib import Path
from transformers import AutoConfig
from optimum.exporters.onnx import main_export
from optimum.exporters.tasks import TasksManager
model_dir = "./model" # your VisionEncoderDecoder checkpoint
out = Path("./model/trio_onnx"); out.mkdir(parents=True, exist_ok=True)
# Build an ONNX config for your model+task
cfg = AutoConfig.from_pretrained(model_dir)
ctor = TasksManager.get_exporter_config_constructor(
model_type=cfg.model_type, backend="onnx", task="image-to-text" # vision→text task
)
onnx_cfg = ctor(config=cfg, task="image-to-text")
# Ask explicitly for the three subgraphs
custom_onnx_configs = {
"encoder_model": onnx_cfg.with_behavior("encoder"),
"decoder_model": onnx_cfg.with_behavior("decoder", use_past=False),
"decoder_with_past_model": onnx_cfg.with_behavior("decoder", use_past=True),
}
# Export. Keep trio separate (avoid automatic merge).
main_export(
model=model_dir,
task="image-to-text",
output=str(out),
custom_onnx_configs=custom_onnx_configs,
no_post_process=True,
)
Why this works: Optimum documents custom_onnx_configs and with_behavior("decoder", use_past=True) to emit decoder_with_past_model.onnx; no_post_process=True prevents the exporter from merging decoders. (Hugging Face)
Verify and align with Transformers.js
- Check the output folder contains exactly:
encoder_model.onnx,decoder_model.onnx,decoder_with_past_model.onnx. This mirrors working web repos. (Hugging Face) - Use that folder structure in your web model repo. Xenova’s captioner card recommends this layout for browser use. (Hugging Face)
Common failure modes and fixes
- Only two files produced: you didn’t request the with-past behavior. Add the
custom_onnx_configsdict as above. (Hugging Face) - Decoder files merged: remove the merge by setting
no_post_process=True. The doc names this exact flag. (Hugging Face) - Unsure which tasks your model supports: query
TasksManager.get_supported_tasks_for_model_type(model_type, "onnx")and pick the vision→text task. The export guide shows this workflow. (Hugging Face) - Why two decoders at all: first-token vs subsequent tokens. Author of Transformers.js explains the duplication and runtime need. (Hugging Face Forums)
Optional: merged decoder
Some exporters can produce a single decoder_model_merged.onnx that handles both first and subsequent tokens. If you prefer that, omit no_post_process=True. The public ViT-GPT2 repo shows merged and split variants side by side. (Hugging Face)