@ahmedbr Coud you fill a bug report with a reproduction script on Issues 路 huggingface/optimum 路 GitHub so that I can have a look at it?
@luckyt The main reason is because you normally want to run the encoder only once, while you鈥檇 like to loop over the decoder when generating. You could say, ok why not wrap everything into a single ONNX, with say an If
node to decide whether or not to run the encoder? Something like this with subgraphs:
This could be doable actually. The issue with that is that usability is a bit harder, as the encoder and decoder do not have the same inputs/outputs. So you would need to create fake input/outputs, which theoretically works, but may lead into errors and be a bit unintuitive.
About generation, what is slightly challenging is that inputs/outputs are fixed with ONNX, and more importantly when exporting we use torch.jit.trace
that can not handle controlflows, that are typically use to handle the without/with past (use KV cache or not) case. In the first step of the generation, you don鈥檛 use the KV cache, while in later steps you do. See transformers/src/transformers/models/t5/modeling_t5.py at v4.30.2 路 huggingface/transformers 路 GitHub & How does the ONNX exporter work for GenerationModel with `past_key_value`?