Optimum & RoBERTa: how far can we trust a quantized model against its pytorch version?


I did adapt this code from Optimum github about the sequence-classification model distilbert-base-uncased-finetuned-sst-2-english to the masked-lm model RoBERTa base.

It works (see the code below) but the predictions are different (see results at the end of this post). I understand the reason (quantization changes the weights values and then the forward calculation of a prediction) but how far does it change the model for inference?

My question is: how far can we trust a quantized model against its pytorch version?

My code:

Quantization of the model

!pip install datasets
!python -m pip install git+https://github.com/huggingface/optimum.git#egg=optimum[onnxruntime]

from optimum.onnxruntime.configuration import AutoQuantizationConfig
from optimum.onnxruntime import ORTQuantizer

# The model we wish to quantize
#model_checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model_checkpoint = "roberta-base"

# The type of quantization to apply
qconfig = AutoQuantizationConfig.arm64(is_static=False, per_channel=False)

#quantizer = ORTQuantizer.from_pretrained(model_checkpoint, feature="sequence-classification")
quantizer = ORTQuantizer.from_pretrained(model_checkpoint, feature="masked-lm")

# Quantize the model!

Inference (get logits)

from functools import partial
from datasets import Dataset
from optimum.onnxruntime.model import ORTModel

# Load quantized model
ort_model = ORTModel("model-quantized.onnx", quantizer._onnx_config)

# Create a dataset or load one from the Hub
#ds = Dataset.from_dict({"sentence": ["I love burritos!"]})
sentence = "The goal of life is <mask>."
ds = Dataset.from_dict({"sentence": [sentence]})

# Tokenize the inputs
def preprocess_fn(ex, tokenizer):
    return tokenizer(ex["sentence"])

tokenized_ds = ds.map(partial(preprocess_fn, tokenizer=quantizer.tokenizer))
ort_outputs = ort_model.evaluation_loop(tokenized_ds)

# Extract logits!
logits = ort_outputs.predictions


import numpy as np

tokens = np.array(tokenized_ds['input_ids'])
mask_token_index = (tokens == quantizer.tokenizer.mask_token_id)[0].nonzero()[0][0]
predicted_token_id = logits[0, mask_token_index].argmax(axis=-1)

# replace mask token by predicted token
tokens[0][mask_token_index] = predicted_token_id

print(quantizer.tokenizer.decode(tokens[0], skip_special_tokens=True))


The previous code will print:

The goal of life is <mask>.
The goal of life is life.

However, the pytorch version of RoBERTa base either in the hub, either in Spaces, gives as result:

1 Like


I did compare inference time of Pytorch and Optimum RoBERTa base model in this Colab notebook.

Here the results:

  • Pytorch: 247 ms
  • Optimum: 719.0 ms

Something wrong, no?

Hi @pierreguillou,

ORTModel was a class added with the primary function for users to be able to obtain the evaluation results in order to compare their original model with the resulting quantized / optimized model, and was meant to be temporary. This class is now depreciated and we encourage users to use our new classes ORTModelForXxx, which will allow you to get better perfomances.

The best way to evaluate your resulting quantized model is to use the same evaluation as its full-precision counterpart and compare both. I would say that there is no possibility to know until then, as different type of quantization (dynamic vs static for example) would impact the model differently, also some architectures and given tasks could be more sensitive to quantization.

Hi @echarlaix,

thanks for the link to new classes ORTModelForXxx but the second code example does not work in Colab (see screen shot). This is the code about:

Optimum Inference also includes methods to convert vanilla Transformers models to optimized ones. Simply pass from_transformers=True to the from_pretrained() method, and your model will be loaded and converted to ONNX on-the-fly.

And there is no link to a notebook in the following sentence (the sentence comes just before the title “Working with the Hugging Face Model Hub”):

You can find a complete walkhrough Optimum Inference for ONNX Runtime in this [notebook](todo:add-link).

Great if you can help on these issues.

The following code is working on my side (adapted from the documentation), can you try :

model = ORTModelForSequenceClassification.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english", from_transformers=True)
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")
onnx_classifier = pipeline("text-classification", model=model, tokenizer=tokenizer)
text = "Hello, my dog is cute"
pred = onnx_classifier(text)

Yes, your code is working @echarlaix because you did not write text= in the pipeline object onnx_classifier as it is done in the Optimum page (check Switching from Transformers to Optimum Inference and the screen shot below).

Good catch, I will correct the documentation, thanks !

Thank you @echarlaix.

Another question: I do not see in the doc, the class ORTModelForMaskedLM (only ORTModelForCausalLM).

Do you plan to update the Optimum library with (at least) this class?

Hi @pierreguillou

Yes we plan to add many more ORTModelForXxx in the future, and are currently working on ORTModelForSeq2SeqLM (#199). Everyone is welcomed to contribute so don’t hesitate to open a PR if you find time to work on the integration of ORTModelForMaskedLM.

1 Like