Optimum & T5 for inference


I saw last week the following announcement about Optimum v1.1:

We released πŸ€— Optimum v1.1 this week to accelerate Transformers with new [ONNX Runtime](https://www.linkedin.com/company/onnxruntime/) tools:

🏎 Train models up to 30% faster (for models like T5) with ORTTrainer!
DeepSpeed is natively supported out of the box. 😍

🏎 Accelerate inference using static and dynamic quantization with ORTQuantizer!
Get >=99% accuracy of the original FP32 model with speed up up to 3x and size reduction up to 4x


In order to test it with a T5 model for inference, I went to Optimum github, copied/pasted the Quantization code into a Colab notebook and setup model_checkpoint and feature as following:

!python -m pip install optimum[onnxruntime]

!pip install sentencepiece

model_checkpoint = "mrm8488/t5-base-finetuned-question-generation-ap"
feature = "text2text-generation"

from optimum.onnxruntime.configuration import AutoQuantizationConfig
from optimum.onnxruntime import ORTQuantizer
from functools import partial
from datasets import Dataset

# Tokenize the inputs
def preprocess_fn(ex, tokenizer):
  return tokenizer(ex["sentence"])

# Create a dataset or load one from the Hub
ds = Dataset.from_dict({"sentence": ["answer: Manuel context: Manuel has created RuPERTa-base with the support of HF-Transformers and Google"]})

# The type of quantization to apply
qconfig = AutoQuantizationConfig.arm64(is_static=False, per_channel=False)
quantizer = ORTQuantizer.from_pretrained(model_checkpoint, feature=feature)

# Quantize the model!

Then, I ran the code but it gave an error:

The `xla_device` argument has been deprecated in v4.4.0 of Transformers. It is ignored and you can safely remove it from your `config.json` file.
The `xla_device` argument has been deprecated in v4.4.0 of Transformers. It is ignored and you can safely remove it from your `config.json` file.
The `xla_device` argument has been deprecated in v4.4.0 of Transformers. It is ignored and you can safely remove it from your `config.json` file.


KeyError                                  Traceback (most recent call last)

<ipython-input-9-56024da8abbc> in <module>()
----> 1 get_ipython().run_cell_magic('time', '', '# The type of quantization to apply\nqconfig = AutoQuantizationConfig.arm64(is_static=False, per_channel=False)\nquantizer = ORTQuantizer.from_pretrained(model_checkpoint, feature=feature)\n\n# Quantize the model!\nquantizer.export(\n    onnx_model_path="model.onnx",\n    onnx_quantized_model_output_path="model-quantized.onnx",\n    quantization_config=qconfig,\n)')

4 frames

<decorator-gen-53> in time(self, line, cell, local_ns)

<timed exec> in <module>()

/usr/local/lib/python3.7/dist-packages/transformers/onnx/features.py in get_model_class_for_feature(feature, framework)
    362         if task not in task_to_automodel:
    363             raise KeyError(
--> 364                 f"Unknown task: {feature}. "
    365                 f"Possible values are {list(FeaturesManager._TASKS_TO_AUTOMODELS.values())}"
    366             )

KeyError: "Unknown task: text-generation. Possible values are
 [<class 'transformers.models.auto.modeling_auto.AutoModel'>, 
<class 'transformers.models.auto.modeling_auto.AutoModelForMaskedLM'>, 
<class 'transformers.models.auto.modeling_auto.AutoModelForCausalLM'>, 
<class 'transformers.models.auto.modeling_auto.AutoModelForSeq2SeqLM'>, 
<class 'transformers.models.auto.modeling_auto.AutoModelForSequenceClassification'>, 
<class 'transformers.models.auto.modeling_auto.AutoModelForTokenClassification'>, 
<class 'transformers.models.auto.modeling_auto.AutoModelForMultipleChoice'>, 
<class 'transformers.models.auto.modeling_auto.AutoModelForQuestionAnswering'>,
 <class 'transformers.models.auto.modeling_auto.AutoModelForImageClassification'>]"

Then, I went to the Optimum documentation but It does look to be updated and I did not find a solution.

Does it mean that Optimum v1.1 can not be used for T5 inference?

Hi @pierreguillou

Optimum currently does not support ONNX Runtime inference for T5 models (or any other encoder-decoder models).

We are however planning to integrate this feature in the near future.

Also the error you have does not come from inference but from the chosen feature you are trying to use when exporting the model to the ONNX format.

You can try :
feature = "seq2seq-lm"

1 Like

Thank you @echarlaix for your answer.

feature = "seq2seq-lm" allows to run the code of my post but not to use the ONNX model as you said.

(ie, the following code fails:

from optimum.onnxruntime import ORTModel

# Load quantized model
ort_model = ORTModel("model-quantized.onnx", quantizer._onnx_config)

tokenized_ds = ds.map(partial(preprocess_fn, tokenizer=quantizer.tokenizer))

 # the code fails at this line
ort_outputs = ort_model.evaluation_loop(tokenized_ds)


And what about Intel Neural Compressor as cited in Optimizing models towards inference. It works for T5 inference?

Another question: where can I find the list of features as seq2seq-lm? Thanks.

Yes exactly, it is not yet possible to use ORTModel to perform ONNX Runtime inference for such models.

Unfortunately, quantization and pruning supports of Intel Neural Compressor were not integrated for text generation examples, but it is something that we plan on working on in the coming months.

You can find the features to export models for different types of topologies or tasks here

1 Like