Optimum & T5 for inference

Hello @echarlaix,

First, thanks a lot for the amazing work, I saw your draft PR (Add seq2seq ort inference by echarlaix 路 Pull Request #199 路 huggingface/optimum 路 GitHub) and I was so excited to improve the speed of my models that I tried it.

I got the same problem that above, saying that T5 models are unsuported :

 File "/home/pierre/projects/openbook-models/.venv/lib/python3.9/site-packages/optimum/onnxruntime/utils.py", line 106, in check_supported_model_or_raise
    raise KeyError(
KeyError: "t5 model type is not supported yet.

And here is my testing code :


class MultipleText2TextGenerationPipeline(Text2TextGenerationPipeline):
    # This is to be able to return multiple outputs per input (else transformers hardcode it to get the first answer)
    def __call__(self, *args: list[Any], **kwargs: Any):
        result: Text2TextPipelineOutput = super(Text2TextGenerationPipeline, self).__call__(*args, **kwargs)
        flatten_results: list[str] = []
        for result_list in result:
            for result_dict in result_list:
                flatten_results.append(result_dict["generated_text"].replace("question: ", ""))
        return flatten_results

class MySuperT5Model:
    def prepare(self):
        model_id = "mrm8488/t5-base-finetuned-question-generation-ap"
        tokenizer = AutoTokenizer.from_pretrained(model_id)
        save_path = Path(self.weights_cache_folder, "optimum_model")
        save_path.mkdir(exist_ok=True)

        optimizer = ORTOptimizer.from_pretrained(model_id, feature="seq2seq-lm")
        opt_config = OptimizationConfig(optimization_level=99, optimize_for_gpu=True)

        optimizer.export(
            onnx_model_path=save_path / "model.onnx",
            onnx_optimized_model_output_path=save_path / "model-optimized.onnx",
            optimization_config=opt_config,
        )

        optimizer.model.config.save_pretrained(save_path)
        model = ORTModelForSeq2SeqLM.from_pretrained(save_path, file_name="model-optimized.onnx")
        self.onnx_clx = MultipleText2TextGenerationPipeline(model=model, tokenizer=tokenizer, device=0)

    def __call__(self, inputs_texts: list[str]) -> list[list[str]]:
        # default_generator have batch_size=8 and num_return_sequence=3, plus all the rest
        output_texts: list[str] = self.onnx_clx(input_texts, **DEFAULT_GENERATOR_OPTIONS)
        # some_batching_logic... generated_questions=.....
        return generated_questions

I would be willing to spend some time integrating T5 as it鈥檚 really important for me to have this model as lightweight and fast as possible.
It should be feasible as T5 is in Export 馃 Transformers Models.

I know it鈥檚 only a draft and I鈥檓 really sorry that I used an early work PR :confused: However, it works great, bravo! If I can be of any help please let me know.

Thanks in advance,
Have a great day.