Hi there, I’m the creator of Transformers.js, a JavaScript library which aims to run HuggingFace models directly in the browser. It relies on optimum to convert PyTorch models to ONNX, which can then be used inside web browsers using onnxruntime-web. For the most part, everything is working fine, but there appears to be a ton of duplicate parameters between decoder_with_past_model.onnx and decoder_model.onnx, for sequence to sequence and causal language models.
For causal language models, such as GPT2, I was able to avoid having to use decoder_model.onnx by creating an empty tensor for past_key_values, and passing the inputs through the decoder_model_with_past.onnx (see here).
I tried employing the same trick for sequence to sequence models (e.g., T5, Bart and Whisper), but ran into a problem since they need past key values for the encoder too. Once again, I tried just creating dummy inputs with a dimension of 0 where concatenation would take place. However, this didn’t work either, as the outputted “present” key value pairs were all empty tensors too.
I asked on the Discord channel, and I was directed to the forum to see if anyone could provide some assistance. Perhaps being able to export smaller parts of the models to avoid significant parameter duplication? This would dramatically improve performance (cutting down model size by up to 50%)!
I look forward to any responses! Thanks
~ Xenova