When exporting seq2seq models with ONNX, why do we need both decoder_with_past_model.onnx and decoder_model.onnx?

Hi there, I’m the creator of Transformers.js, a JavaScript library which aims to run HuggingFace models directly in the browser. It relies on optimum to convert PyTorch models to ONNX, which can then be used inside web browsers using onnxruntime-web. For the most part, everything is working fine, but there appears to be a ton of duplicate parameters between decoder_with_past_model.onnx and decoder_model.onnx, for sequence to sequence and causal language models.

For causal language models, such as GPT2, I was able to avoid having to use decoder_model.onnx by creating an empty tensor for past_key_values, and passing the inputs through the decoder_model_with_past.onnx (see here).

I tried employing the same trick for sequence to sequence models (e.g., T5, Bart and Whisper), but ran into a problem since they need past key values for the encoder too. Once again, I tried just creating dummy inputs with a dimension of 0 where concatenation would take place. However, this didn’t work either, as the outputted “present” key value pairs were all empty tensors too.

I asked on the Discord channel, and I was directed to the forum to see if anyone could provide some assistance. Perhaps being able to export smaller parts of the models to avoid significant parameter duplication? This would dramatically improve performance (cutting down model size by up to 50%)!

I look forward to any responses! Thanks :hugs:

~ Xenova

Hi @Xenova , thank you for having a try at this!

I have the same experience as you for gpt2 (although I had to add position_ids as inputs to have matching logits due to this logic) using only decoder_with_past_model.onnx.

Unfortunately I did not have time to try with encoder-decoder models, I was assuming it was possible. I could have a look shortly.

Alternatively, are you able to support ONNX models that have subgraphs? That’s the approach we are currently taking in Optimum, for reference: Validating ONNX model fails for GPT-J · Issue #607 · huggingface/optimum · GitHub . This is currently available only for decoder-only models for now though, I plan to extend to encoder-decoder architectures.

If you export a decoder-only model you’ll see as output a merged decoder in charge of both cases without past/with past: optimum-cli export onnx gpt2 gpt2_onnx/

I have the same experience as you for gpt2 (although I had to add position_ids as inputs to have matching logits due to this logic) using only decoder_with_past_model.onnx .

Are you still having the problem, even if you pass an empty tensor (of shape [batch_size, this.num_heads, 0, this.dim_kv])? I was able to bypass that check since, in that case, it isn’t None, and then the concatenation logic works.

As you can see, here is the code to create the model (only uses decoder_with_past_model.onnx):

and it works correctly (see the demo for proof).


Unfortunately I did not have time to try with encoder-decoder models, I was assuming it was possible. I could have a look shortly.

That would be greatly appreciated! I’ve been banging my head against the wall for 2 days now, but I am getting strange outputs (sometimes the logits are NaN, other times the generated present_key_values are empty tensors).

Alternatively, are you able to support ONNX models that have subgraphs? That’s the approach we are currently taking in Optimum, for reference

I should say that I am not an expert in ONNX haha; so, I’m not too sure what subgraphs are :eyes: For me, at least, I treat ONNX as a magical black box, and I provide the inputs and process the outputs.

If you export a decoder-only model you’ll see as output a merged decoder in charge of both cases without past/with past

Okay great! Yes, this is something I desperately need for Transformers.js … the smaller the model is, the better!

Oh, I see, so that past_key_values[0][0].size(-2) is zero. I did not have a try at it - did not expect ONNX Runtime to work with zero-sized inputs, but I’ll have a try!

Subgraphs is basically a fancy way to handle controlflows (if/else, loops) with ONNX. Depending on whether you need to use k/v cache or not, you can dispatch the compute on one or the other of the two branches of an If node, that shares weights between both:

As a rule of thumb though, if you are able to avoid subgraphs (e.g. thanks to the hack with fake k/v), I would recommend to avoid it.

Haha yes, I was quite surprised myself when I got it working with gpt2.

As a rule of thumb though, if you are able to avoid subgraphs (e.g. thanks to the hack with fake k/v), I would recommend to avoid it.

Understood :+1: . Is there anything I can do to help try get this working? I could continue trying myself to get the empty tensor working for encoder-decoder models. (Although, I do think fixing this issue will either fix it, or help get closer to our goal! :slight_smile: )

If you think there are inputs/outputs that are ill-defined, you could have a look at optimum/optimum/exporters/onnx at main · huggingface/optimum · GitHub (in order of hierarchy: base.py, config.py, model_configs.py as explained in here, and hack and try from there, for example optimum/base.py at 913f9d5fe619acc852210f5978c62e1271185287 · huggingface/optimum · GitHub and optimum/config.py at 913f9d5fe619acc852210f5978c62e1271185287 · huggingface/optimum · GitHub.

I’ll have a look if there’s an issue with the export itself (like whisper it seems), or if it’s fully an implementation problem.

HUGE update! I just got whisper seq2seq working WITHOUT having to load both decoders. This is done using the merged decoder.

There were some hacks still necessary though, namely:

  1. Creating empty dummy past key values for encoder and decoder when using cache_branch
let encoder_heads = this.config.encoder_attention_heads;
let encoder_dims = [1, encoder_heads, 0, this.config.d_model / encoder_heads];
for (let i = 0; i < this.config.encoder_layers; ++i) {
    decoderFeeds[`past_key_values.${i}.encoder.key`] = new Tensor('float32', [], encoder_dims)
    decoderFeeds[`past_key_values.${i}.encoder.value`] = new Tensor('float32', [], encoder_dims)
}

let decoder_heads = this.config.decoder_attention_heads;
let decoder_dims = [1, decoder_heads, 0, this.config.d_model / decoder_heads];
for (let i = 0; i < this.config.decoder_layers; ++i) {
    decoderFeeds[`past_key_values.${i}.decoder.key`] = new Tensor('float32', [], decoder_dims)
    decoderFeeds[`past_key_values.${i}.decoder.value`] = new Tensor('float32', [], decoder_dims)
}

  1. I don’t know if onnx supports simple booleans, so I just created a tensor with a single boolean: use_cache_branch: new Tensor('bool', [pastKeyValues !== null], [1]). Perhaps the if logic checks for all values to be true, and so, it works. In other words, I’m not questioning it too much haha

  2. There were also a ton of warnings produces, but, it works… so, I don’t really care :wink: :


I will see if it is as simple to convert the other models types. :crossed_fingers:

I’m facing the exact problem with whisper model. Could you please specify where you have applied your hacking codes in the conversion?

Here’s a link to the source code: transformers.js/models.js at main · xenova/transformers.js · GitHub

Hopefully that will be enough for you to decipher! :smile:

@Xenova we are trying to make an App using Quantized whisper-finetuned model and run it in the browser using react our problem that we don’t know how to generate predicted output text from last_hidden_state and logits

Hi @Noahloghman! It seems it couldn’t find the model. What is the model you were trying to load?

Hi @regisss sorry for the late reply, its Llama 27B from hugging face.
Thank you

Hi guys Hi @xenova I have this issue :
[E:onnxruntime:, inference_session.cc:1533 onnxruntime::InferenceSession::Initialize::<lambda_9a5ue43270b854edk3er320c0a5c4y9a>::operator ()] Exception during initialization: D:\a_work\1\s\onnxruntime\core\optimizer\initializer.cc:31 onnxruntime::Initializer::Initializer !model_path.IsEmpty() was false. model_path must not be empty. Ensure that a path is provided when the model
is created or loaded.

I think this still an issue