Optimum & T5 for inference


I saw last week the following announcement about Optimum v1.1:

We released πŸ€— Optimum v1.1 this week to accelerate Transformers with new [ONNX Runtime](https://www.linkedin.com/company/onnxruntime/) tools:

🏎 Train models up to 30% faster (for models like T5) with ORTTrainer!
DeepSpeed is natively supported out of the box. 😍

🏎 Accelerate inference using static and dynamic quantization with ORTQuantizer!
Get >=99% accuracy of the original FP32 model with speed up up to 3x and size reduction up to 4x


In order to test it with a T5 model for inference, I went to Optimum github, copied/pasted the Quantization code into a Colab notebook and setup model_checkpoint and feature as following:

!python -m pip install optimum[onnxruntime]

!pip install sentencepiece

model_checkpoint = "mrm8488/t5-base-finetuned-question-generation-ap"
feature = "text2text-generation"

from optimum.onnxruntime.configuration import AutoQuantizationConfig
from optimum.onnxruntime import ORTQuantizer
from functools import partial
from datasets import Dataset

# Tokenize the inputs
def preprocess_fn(ex, tokenizer):
  return tokenizer(ex["sentence"])

# Create a dataset or load one from the Hub
ds = Dataset.from_dict({"sentence": ["answer: Manuel context: Manuel has created RuPERTa-base with the support of HF-Transformers and Google"]})

# The type of quantization to apply
qconfig = AutoQuantizationConfig.arm64(is_static=False, per_channel=False)
quantizer = ORTQuantizer.from_pretrained(model_checkpoint, feature=feature)

# Quantize the model!

Then, I ran the code but it gave an error:

The `xla_device` argument has been deprecated in v4.4.0 of Transformers. It is ignored and you can safely remove it from your `config.json` file.
The `xla_device` argument has been deprecated in v4.4.0 of Transformers. It is ignored and you can safely remove it from your `config.json` file.
The `xla_device` argument has been deprecated in v4.4.0 of Transformers. It is ignored and you can safely remove it from your `config.json` file.


KeyError                                  Traceback (most recent call last)

<ipython-input-9-56024da8abbc> in <module>()
----> 1 get_ipython().run_cell_magic('time', '', '# The type of quantization to apply\nqconfig = AutoQuantizationConfig.arm64(is_static=False, per_channel=False)\nquantizer = ORTQuantizer.from_pretrained(model_checkpoint, feature=feature)\n\n# Quantize the model!\nquantizer.export(\n    onnx_model_path="model.onnx",\n    onnx_quantized_model_output_path="model-quantized.onnx",\n    quantization_config=qconfig,\n)')

4 frames

<decorator-gen-53> in time(self, line, cell, local_ns)

<timed exec> in <module>()

/usr/local/lib/python3.7/dist-packages/transformers/onnx/features.py in get_model_class_for_feature(feature, framework)
    362         if task not in task_to_automodel:
    363             raise KeyError(
--> 364                 f"Unknown task: {feature}. "
    365                 f"Possible values are {list(FeaturesManager._TASKS_TO_AUTOMODELS.values())}"
    366             )

KeyError: "Unknown task: text-generation. Possible values are
 [<class 'transformers.models.auto.modeling_auto.AutoModel'>, 
<class 'transformers.models.auto.modeling_auto.AutoModelForMaskedLM'>, 
<class 'transformers.models.auto.modeling_auto.AutoModelForCausalLM'>, 
<class 'transformers.models.auto.modeling_auto.AutoModelForSeq2SeqLM'>, 
<class 'transformers.models.auto.modeling_auto.AutoModelForSequenceClassification'>, 
<class 'transformers.models.auto.modeling_auto.AutoModelForTokenClassification'>, 
<class 'transformers.models.auto.modeling_auto.AutoModelForMultipleChoice'>, 
<class 'transformers.models.auto.modeling_auto.AutoModelForQuestionAnswering'>,
 <class 'transformers.models.auto.modeling_auto.AutoModelForImageClassification'>]"

Then, I went to the Optimum documentation but It does look to be updated and I did not find a solution.

Does it mean that Optimum v1.1 can not be used for T5 inference?

Hi @pierreguillou

Optimum currently does not support ONNX Runtime inference for T5 models (or any other encoder-decoder models).

We are however planning to integrate this feature in the near future.

Also the error you have does not come from inference but from the chosen feature you are trying to use when exporting the model to the ONNX format.

You can try :
feature = "seq2seq-lm"

1 Like

Thank you @echarlaix for your answer.

feature = "seq2seq-lm" allows to run the code of my post but not to use the ONNX model as you said.

(ie, the following code fails:

from optimum.onnxruntime import ORTModel

# Load quantized model
ort_model = ORTModel("model-quantized.onnx", quantizer._onnx_config)

tokenized_ds = ds.map(partial(preprocess_fn, tokenizer=quantizer.tokenizer))

 # the code fails at this line
ort_outputs = ort_model.evaluation_loop(tokenized_ds)


And what about Intel Neural Compressor as cited in Optimizing models towards inference. It works for T5 inference?

Another question: where can I find the list of features as seq2seq-lm? Thanks.

Yes exactly, it is not yet possible to use ORTModel to perform ONNX Runtime inference for such models.

Unfortunately, quantization and pruning supports of Intel Neural Compressor were not integrated for text generation examples, but it is something that we plan on working on in the coming months.

You can find the features to export models for different types of topologies or tasks here


Hello @echarlaix,

First, thanks a lot for the amazing work, I saw your draft PR (Add seq2seq ort inference by echarlaix Β· Pull Request #199 Β· huggingface/optimum Β· GitHub) and I was so excited to improve the speed of my models that I tried it.

I got the same problem that above, saying that T5 models are unsuported :

 File "/home/pierre/projects/openbook-models/.venv/lib/python3.9/site-packages/optimum/onnxruntime/utils.py", line 106, in check_supported_model_or_raise
    raise KeyError(
KeyError: "t5 model type is not supported yet.

And here is my testing code :

class MultipleText2TextGenerationPipeline(Text2TextGenerationPipeline):
    # This is to be able to return multiple outputs per input (else transformers hardcode it to get the first answer)
    def __call__(self, *args: list[Any], **kwargs: Any):
        result: Text2TextPipelineOutput = super(Text2TextGenerationPipeline, self).__call__(*args, **kwargs)
        flatten_results: list[str] = []
        for result_list in result:
            for result_dict in result_list:
                flatten_results.append(result_dict["generated_text"].replace("question: ", ""))
        return flatten_results

class MySuperT5Model:
    def prepare(self):
        model_id = "mrm8488/t5-base-finetuned-question-generation-ap"
        tokenizer = AutoTokenizer.from_pretrained(model_id)
        save_path = Path(self.weights_cache_folder, "optimum_model")

        optimizer = ORTOptimizer.from_pretrained(model_id, feature="seq2seq-lm")
        opt_config = OptimizationConfig(optimization_level=99, optimize_for_gpu=True)

            onnx_model_path=save_path / "model.onnx",
            onnx_optimized_model_output_path=save_path / "model-optimized.onnx",

        model = ORTModelForSeq2SeqLM.from_pretrained(save_path, file_name="model-optimized.onnx")
        self.onnx_clx = MultipleText2TextGenerationPipeline(model=model, tokenizer=tokenizer, device=0)

    def __call__(self, inputs_texts: list[str]) -> list[list[str]]:
        # default_generator have batch_size=8 and num_return_sequence=3, plus all the rest
        output_texts: list[str] = self.onnx_clx(input_texts, **DEFAULT_GENERATOR_OPTIONS)
        # some_batching_logic... generated_questions=.....
        return generated_questions

I would be willing to spend some time integrating T5 as it’s really important for me to have this model as lightweight and fast as possible.
It should be feasible as T5 is in Export πŸ€— Transformers Models.

I know it’s only a draft and I’m really sorry that I used an early work PR :confused: However, it works great, bravo! If I can be of any help please let me know.

Thanks in advance,
Have a great day.

For the posterity or people searching to solve the same problem :

After a bit of research, the only missing piece would be adding an onnxruntime/transformers/onnx_model_t5.py that is being instantiated by optimum so I guess the problem is not on the optimum end but on onnxruntime.

I’m using Bart for sentence infilling and was also wondering about Optimum. I’m assuming it isn’t currently supported but is there an ETA (or somewhere to keep track of progress) on this kind of conditional generation with seq2seq?


Hi @ierezell and @jbmaxwell,

Thanks for the feedback @ierezell, we are very pleased to hear that you were able to easily use our ORTModelForSeq2SeqLM class. You are right, this should be added on the ONNX Runtime side.

Concerning seq2seq model inference @jbmaxwell, you can see the progress here, this PR should be merged this week or the following.


Hi @echarlaix,

Thanks for the reply :slight_smile:

I opened and issue here if you want to track status/progress of onnx/onnxruntime for T5 models.

Have a great day.

1 Like

Thanks a lot for taking care of this @ierezell, that will be a great addition :hugs:

1 Like

Hi @pierreguillou, @ierezell and @jbmaxwell,

The PR for inference of Seq2Seq models is merged.


Many thanks @echarlaix !

One question: until now, one useful solution in order to use T5 with ONNX Runtime was the library fastT5 (see Search results for 'fastT5' - Hugging Face Forums).

If I understood correctly, Optimum can now be used with T5, which means that Optimum allows us to do everything (and more?) on T5 that fastT5 allowed. Yes?

Currently ORTModelForSeq2SeqLM allows the inference of different type of architecture (such as T5 but also Bart, MBart, M2M100 and others). We are also working on the refactorization of our ORTOptimizer / ORTQuantizer classes to be able to easily optimize and dynamically quantize those models.

1 Like

Hello, first of all, I’d like to appreciate the entire team responsible for the development of Optimum. It has relived the most important problem related to transformers for production. However, can you please share a code snippet of how to use ORTModelForSeq2SeqLM for models like BART & T5?

1 Like

Hi @nid989 ! You can find usage examples in our documentation :hugs:

1 Like

@echarlaix On the gpu, when the sentence is long, the onnx model reasoning becomes slow. Can this be solved? The quantization speed of the onnx model gpu of the m2m100 is slower because AutoQuantizationConfig.avx512_ Does VNNI only support CPU?