The returned outputs.cross_attentions is a tuple of None values instead of the expected tensors of numerical values.
I’ve confirmed the output structure of the model.generate function is ['sequences', 'decoder_attentions', 'cross_attentions', 'past_key_values'].
I’ve confirmed operation of the model previously in pytorch prior to onnx conversion. Am I missing a configuration to ensure the generate function operates as expected? Or am I incorrectly assuming that there is support for the generate function for image-to-text-with-past models? Any help is appreciated. Thanks
Hi there!
It sounds like you’re encountering an issue where the exported ONNX model’s generate() call returns a tuple for cross_attentions with all elements set to None—even though your original PyTorch model produced proper tensor outputs.
A few points to consider:
Exporting Attention Outputs:
When using onnx_export_from_model, it isn’t always guaranteed that all intermediate attention outputs (like cross attentions) will be exported by default. In many cases, ONNX export utilities optimize the graph and might “prune” outputs that aren’t considered essential to the final generated sequence. In your case, it looks like the exported graph isn’t returning these values even though the output structure includes the key "cross_attentions". Check if there’s an export flag (or configuration setting) to force retention of all attention outputs. Sometimes libraries offer a “debug” or “full output” mode that prevents pruning.
Model Configuration for Generate:
It’s worth verifying whether your ONNX model configuration (and the ORTModelForVision2Seq wrapper) has any extra parameters for returning attentions. In some implementations (for instance, with Hugging Face’s Transformers when using PyTorch), you might need to pass an argument such as output_attentions=True or even configure the model’s config object. Although you’re already passing output_attentions=True in the generate() call, check whether this flag is honored by the ONNX runtime version of generate. Sometimes the ONNX export path does not fully support all optional outputs (especially for more “exotic” tasks like image-to-text-with-past).
Support for “image-to-text-with-past” in ONNX:
You mentioned that you confirmed the PyTorch model works correctly before export. It’s possible that the current ONNX conversion (or the ORTModelForVision2Seq implementation) might not yet support generation with all the extra outputs (such as cross attentions) for models that use past key values. In other words, the generate() method might be supported only in a “basic” mode that returns the sequences and decoder attentions, with cross attentions omitted. You might check the documentation or GitHub issues for Optimum or the ORTModelForVision2Seq class to see if this is a known limitation.
Workarounds:
If your primary goal is to obtain the generated text, you might ignore cross_attentions if you’re not planning to use them downstream.
If you require the attention maps for analysis, you might try modifying the export configuration or even manually editing the ONNX graph (if feasible) to retain these intermediate outputs.
Alternatively, consider raising an issue in the Optimum repository if this behavior isn’t documented—there may be ongoing work to improve support for such outputs.
In summary, it appears that either a configuration flag is missing to force ONNX export to include cross_attentions, or the ONNX runtime implementation for generate in these image-to-text-with-past models currently does not support returning cross attentions. I’d recommend checking for any export options that preserve intermediate outputs and, if none exist, filing an issue with the maintainers.
Hope this helps, and thanks for sharing your experience!
@Alanturner2 Thanks for the insightful post. I was able to confirm via Netron that the onnx decoder is only returning past kv’s and logits and not the attentions. Passing in a simple model_kwargs={"output_attentions": True} in the export call doesn’t change anything. I’m assuming its not quite that simple and may possibly require a custom config.