Hi there, I’m the creator of Transformers.js, a JavaScript library which aims to run HuggingFace models directly in the browser. It relies on optimum to convert PyTorch models to ONNX, which can then be used inside web browsers using onnxruntime-web. For the most part, everything is working fine, but there appears to be a ton of duplicate parameters between decoder_with_past_model.onnx and decoder_model.onnx, for sequence to sequence and causal language models.
For causal language models, such as GPT2, I was able to avoid having to use decoder_model.onnx by creating an empty tensor for past_key_values, and passing the inputs through the decoder_model_with_past.onnx (see here).
I tried employing the same trick for sequence to sequence models (e.g., T5, Bart and Whisper), but ran into a problem since they need past key values for the encoder too. Once again, I tried just creating dummy inputs with a dimension of 0 where concatenation would take place. However, this didn’t work either, as the outputted “present” key value pairs were all empty tensors too.
I asked on the Discord channel, and I was directed to the forum to see if anyone could provide some assistance. Perhaps being able to export smaller parts of the models to avoid significant parameter duplication? This would dramatically improve performance (cutting down model size by up to 50%)!
I have the same experience as you for gpt2 (although I had to add position_ids as inputs to have matching logits due to this logic) using only decoder_with_past_model.onnx.
Unfortunately I did not have time to try with encoder-decoder models, I was assuming it was possible. I could have a look shortly.
Alternatively, are you able to support ONNX models that have subgraphs? That’s the approach we are currently taking in Optimum, for reference: Validating ONNX model fails for GPT-J · Issue #607 · huggingface/optimum · GitHub . This is currently available only for decoder-only models for now though, I plan to extend to encoder-decoder architectures.
If you export a decoder-only model you’ll see as output a merged decoder in charge of both cases without past/with past: optimum-cli export onnx gpt2 gpt2_onnx/
I have the same experience as you for gpt2 (although I had to add position_ids as inputs to have matching logits due to this logic) using only decoder_with_past_model.onnx .
Are you still having the problem, even if you pass an empty tensor (of shape [batch_size, this.num_heads, 0, this.dim_kv])? I was able to bypass that check since, in that case, it isn’t None, and then the concatenation logic works.
As you can see, here is the code to create the model (only uses decoder_with_past_model.onnx):
Unfortunately I did not have time to try with encoder-decoder models, I was assuming it was possible. I could have a look shortly.
That would be greatly appreciated! I’ve been banging my head against the wall for 2 days now, but I am getting strange outputs (sometimes the logits are NaN, other times the generated present_key_values are empty tensors).
Alternatively, are you able to support ONNX models that have subgraphs? That’s the approach we are currently taking in Optimum, for reference
I should say that I am not an expert in ONNX haha; so, I’m not too sure what subgraphs are For me, at least, I treat ONNX as a magical black box, and I provide the inputs and process the outputs.
If you export a decoder-only model you’ll see as output a merged decoder in charge of both cases without past/with past
Okay great! Yes, this is something I desperately need for Transformers.js … the smaller the model is, the better!
Oh, I see, so that past_key_values[0][0].size(-2) is zero. I did not have a try at it - did not expect ONNX Runtime to work with zero-sized inputs, but I’ll have a try!
Subgraphs is basically a fancy way to handle controlflows (if/else, loops) with ONNX. Depending on whether you need to use k/v cache or not, you can dispatch the compute on one or the other of the two branches of an If node, that shares weights between both:
Haha yes, I was quite surprised myself when I got it working with gpt2.
As a rule of thumb though, if you are able to avoid subgraphs (e.g. thanks to the hack with fake k/v), I would recommend to avoid it.
Understood . Is there anything I can do to help try get this working? I could continue trying myself to get the empty tensor working for encoder-decoder models. (Although, I do think fixing this issue will either fix it, or help get closer to our goal! )
HUGE update! I just got whisper seq2seq working WITHOUT having to load both decoders. This is done using the merged decoder.
There were some hacks still necessary though, namely:
Creating empty dummy past key values for encoder and decoder when using cache_branch
let encoder_heads = this.config.encoder_attention_heads;
let encoder_dims = [1, encoder_heads, 0, this.config.d_model / encoder_heads];
for (let i = 0; i < this.config.encoder_layers; ++i) {
decoderFeeds[`past_key_values.${i}.encoder.key`] = new Tensor('float32', [], encoder_dims)
decoderFeeds[`past_key_values.${i}.encoder.value`] = new Tensor('float32', [], encoder_dims)
}
let decoder_heads = this.config.decoder_attention_heads;
let decoder_dims = [1, decoder_heads, 0, this.config.d_model / decoder_heads];
for (let i = 0; i < this.config.decoder_layers; ++i) {
decoderFeeds[`past_key_values.${i}.decoder.key`] = new Tensor('float32', [], decoder_dims)
decoderFeeds[`past_key_values.${i}.decoder.value`] = new Tensor('float32', [], decoder_dims)
}
I don’t know if onnx supports simple booleans, so I just created a tensor with a single boolean: use_cache_branch: new Tensor('bool', [pastKeyValues !== null], [1]). Perhaps the if logic checks for all values to be true, and so, it works. In other words, I’m not questioning it too much haha
There were also a ton of warnings produces, but, it works… so, I don’t really care :
@Xenova we are trying to make an App using Quantized whisper-finetuned model and run it in the browser using react our problem that we don’t know how to generate predicted output text from last_hidden_state and logits
Hi guys Hi @xenova I have this issue :
[E:onnxruntime:, inference_session.cc:1533 onnxruntime::InferenceSession::Initialize::<lambda_9a5ue43270b854edk3er320c0a5c4y9a>::operator ()] Exception during initialization: D:\a_work\1\s\onnxruntime\core\optimizer\initializer.cc:31 onnxruntime::Initializer::Initializer !model_path.IsEmpty() was false. model_path must not be empty. Ensure that a path is provided when the model
is created or loaded.