Incorrect Cross Attention Values from Generate Function of ORTModelForVision2Seq

ctufts · January 31, 2025, 10:13pm

I’ve converted a DONUT model to fp16 to ONNX with optimum utilizing the onnx_export_from_model function:

model = VisionEncoderDecoderModel.from_pretrained(model_dir)
model.load_state_dict(state_dict, assign=True) 
model.eval()
model = model.half()
model = model.to(DEVICE)    
onnx_model = onnx_export_from_model(
        model,
        task="image-to-text-with-past",
        device=DEVICE,
        output=onnx_path
    )

I’m then loading the function for inference via:

model = ORTModelForVision2Seq.from_pretrained(onnx_path, provider="CUDAExecutionProvider")

when calling

outputs = self.model.generate(
                pixel_values,
                max_length=max_length,
                num_beams=1,
                bad_words_ids=[[processor.tokenizer.unk_token_id]],
                return_dict_in_generate=True,
                output_attentions=True,
            )

The returned outputs.cross_attentions is a tuple of None values instead of the expected tensors of numerical values.
I’ve confirmed the output structure of the model.generate function is ['sequences', 'decoder_attentions', 'cross_attentions', 'past_key_values'].

I’ve confirmed operation of the model previously in pytorch prior to onnx conversion. Am I missing a configuration to ensure the generate function operates as expected? Or am I incorrectly assuming that there is support for the generate function for image-to-text-with-past models? Any help is appreciated. Thanks

KateWinslet · January 31, 2025, 11:59pm

Check ONNX export compatibility, verify attention settings, and ensure ONNX Runtime supports generate and attention layers for image-to-text tasks.

Alanturner2 · February 1, 2025, 1:32am

Hi there!
It sounds like you’re encountering an issue where the exported ONNX model’s generate() call returns a tuple for cross_attentions with all elements set to None—even though your original PyTorch model produced proper tensor outputs.

A few points to consider:

Exporting Attention Outputs:
When using onnx_export_from_model, it isn’t always guaranteed that all intermediate attention outputs (like cross attentions) will be exported by default. In many cases, ONNX export utilities optimize the graph and might “prune” outputs that aren’t considered essential to the final generated sequence. In your case, it looks like the exported graph isn’t returning these values even though the output structure includes the key "cross_attentions". Check if there’s an export flag (or configuration setting) to force retention of all attention outputs. Sometimes libraries offer a “debug” or “full output” mode that prevents pruning.
Model Configuration for Generate:
It’s worth verifying whether your ONNX model configuration (and the ORTModelForVision2Seq wrapper) has any extra parameters for returning attentions. In some implementations (for instance, with Hugging Face’s Transformers when using PyTorch), you might need to pass an argument such as output_attentions=True or even configure the model’s config object. Although you’re already passing output_attentions=True in the generate() call, check whether this flag is honored by the ONNX runtime version of generate. Sometimes the ONNX export path does not fully support all optional outputs (especially for more “exotic” tasks like image-to-text-with-past).
Support for “image-to-text-with-past” in ONNX:
You mentioned that you confirmed the PyTorch model works correctly before export. It’s possible that the current ONNX conversion (or the ORTModelForVision2Seq implementation) might not yet support generation with all the extra outputs (such as cross attentions) for models that use past key values. In other words, the generate() method might be supported only in a “basic” mode that returns the sequences and decoder attentions, with cross attentions omitted. You might check the documentation or GitHub issues for Optimum or the ORTModelForVision2Seq class to see if this is a known limitation.
Workarounds:

If your primary goal is to obtain the generated text, you might ignore cross_attentions if you’re not planning to use them downstream.
If you require the attention maps for analysis, you might try modifying the export configuration or even manually editing the ONNX graph (if feasible) to retain these intermediate outputs.
Alternatively, consider raising an issue in the Optimum repository if this behavior isn’t documented—there may be ongoing work to improve support for such outputs.

In summary, it appears that either a configuration flag is missing to force ONNX export to include cross_attentions, or the ONNX runtime implementation for generate in these image-to-text-with-past models currently does not support returning cross attentions. I’d recommend checking for any export options that preserve intermediate outputs and, if none exist, filing an issue with the maintainers.

Hope this helps, and thanks for sharing your experience!

ctufts · February 1, 2025, 8:33pm

@Alanturner2 Thanks for the insightful post. I was able to confirm via Netron that the onnx decoder is only returning past kv’s and logits and not the attentions. Passing in a simple model_kwargs={"output_attentions": True} in the export call doesn’t change anything. I’m assuming its not quite that simple and may possibly require a custom config.

Topic		Replies	Views
Problem with returning decoder cross attentions through generate function 🤗Transformers	0	25	October 25, 2024
How does the ONNX exporter work for GenerationModel with `past_key_value`? 🤗Optimum	9	2398	February 17, 2023
T5: why do we have more tokens expressed via cross attentions than the decoded sequence? Intermediate	1	386	February 21, 2023
Looking for help converting transformers to ONNX with HF Optimum 🤗Transformers	0	277	November 9, 2023
When exporting seq2seq models with ONNX, why do we need both decoder_with_past_model.onnx and decoder_model.onnx? 🤗Optimum	12	4570	March 7, 2024

Incorrect Cross Attention Values from Generate Function of ORTModelForVision2Seq

Related topics