Optimum & RoBERTa: how far can we trust a quantized model against its pytorch version?

pierreguillou · May 27, 2022, 8:00pm

Hi,

I did adapt this code from Optimum github about the sequence-classification model distilbert-base-uncased-finetuned-sst-2-english to the masked-lm model RoBERTa base.

It works (see the code below) but the predictions are different (see results at the end of this post). I understand the reason (quantization changes the weights values and then the forward calculation of a prediction) but how far does it change the model for inference?

My question is: how far can we trust a quantized model against its pytorch version?

My code:

Quantization of the model

!pip install datasets
!python -m pip install git+https://github.com/huggingface/optimum.git#egg=optimum[onnxruntime]

from optimum.onnxruntime.configuration import AutoQuantizationConfig
from optimum.onnxruntime import ORTQuantizer

# The model we wish to quantize
#model_checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model_checkpoint = "roberta-base"

# The type of quantization to apply
qconfig = AutoQuantizationConfig.arm64(is_static=False, per_channel=False)

#quantizer = ORTQuantizer.from_pretrained(model_checkpoint, feature="sequence-classification")
quantizer = ORTQuantizer.from_pretrained(model_checkpoint, feature="masked-lm")

# Quantize the model!
quantizer.export(
    onnx_model_path="model.onnx",
    onnx_quantized_model_output_path="model-quantized.onnx",
    quantization_config=qconfig,
)

Inference (get logits)

from functools import partial
from datasets import Dataset
from optimum.onnxruntime.model import ORTModel

# Load quantized model
ort_model = ORTModel("model-quantized.onnx", quantizer._onnx_config)

# Create a dataset or load one from the Hub
#ds = Dataset.from_dict({"sentence": ["I love burritos!"]})
sentence = "The goal of life is <mask>."
ds = Dataset.from_dict({"sentence": [sentence]})

# Tokenize the inputs
def preprocess_fn(ex, tokenizer):
    return tokenizer(ex["sentence"])

tokenized_ds = ds.map(partial(preprocess_fn, tokenizer=quantizer.tokenizer))
ort_outputs = ort_model.evaluation_loop(tokenized_ds)

# Extract logits!
logits = ort_outputs.predictions

Prediction

import numpy as np

tokens = np.array(tokenized_ds['input_ids'])
mask_token_index = (tokens == quantizer.tokenizer.mask_token_id)[0].nonzero()[0][0]
predicted_token_id = logits[0, mask_token_index].argmax(axis=-1)

# replace mask token by predicted token
tokens[0][mask_token_index] = predicted_token_id

print(sentence)
print(quantizer.tokenizer.decode(tokens[0], skip_special_tokens=True))

Results

The previous code will print:

The goal of life is <mask>.
The goal of life is life.

However, the pytorch version of RoBERTa base either in the hub, either in Spaces, gives as result:

pierreguillou · May 31, 2022, 6:20pm

Hi.

I did compare inference time of Pytorch and Optimum RoBERTa base model in this Colab notebook.

Here the results:

Pytorch: 247 ms
Optimum: 719.0 ms

Something wrong, no?

echarlaix · June 2, 2022, 12:54pm

Hi @pierreguillou,

ORTModel was a class added with the primary function for users to be able to obtain the evaluation results in order to compare their original model with the resulting quantized / optimized model, and was meant to be temporary. This class is now depreciated and we encourage users to use our new classes ORTModelForXxx, which will allow you to get better perfomances.

The best way to evaluate your resulting quantized model is to use the same evaluation as its full-precision counterpart and compare both. I would say that there is no possibility to know until then, as different type of quantization (dynamic vs static for example) would impact the model differently, also some architectures and given tasks could be more sensitive to quantization.

pierreguillou · June 2, 2022, 2:50pm

Hi @echarlaix,

thanks for the link to new classes ORTModelForXxx but the second code example does not work in Colab (see screen shot). This is the code about:

Optimum Inference also includes methods to convert vanilla Transformers models to optimized ones. Simply pass from_transformers=True to the from_pretrained() method, and your model will be loaded and converted to ONNX on-the-fly.

And there is no link to a notebook in the following sentence (the sentence comes just before the title “Working with the Hugging Face Model Hub”):

You can find a complete walkhrough Optimum Inference for ONNX Runtime in this [notebook](todo:add-link).

Great if you can help on these issues.

echarlaix · June 2, 2022, 3:31pm

The following code is working on my side (adapted from the documentation), can you try :

model = ORTModelForSequenceClassification.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english", from_transformers=True)
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")
onnx_classifier = pipeline("text-classification", model=model, tokenizer=tokenizer)
text = "Hello, my dog is cute"
pred = onnx_classifier(text)

pierreguillou · June 2, 2022, 9:55pm

Yes, your code is working @echarlaix because you did not write text= in the pipeline object onnx_classifier as it is done in the Optimum page (check Switching from Transformers to Optimum Inference and the screen shot below).

echarlaix · June 3, 2022, 1:01pm

Good catch, I will correct the documentation, thanks !

pierreguillou · June 3, 2022, 4:25pm

Thank you @echarlaix.

Another question: I do not see in the doc, the class ORTModelForMaskedLM (only ORTModelForCausalLM).

Do you plan to update the Optimum library with (at least) this class?

echarlaix · June 6, 2022, 8:39am

Hi @pierreguillou

Yes we plan to add many more ORTModelForXxx in the future, and are currently working on ORTModelForSeq2SeqLM (#199). Everyone is welcomed to contribute so don’t hesitate to open a PR if you find time to work on the integration of ORTModelForMaskedLM.

nickmuchi · July 22, 2022, 2:37am

Hi There, I tried the above code snippet and also installed the necessary packages but got a symlink error below:

I tried turning on the dev mode on my windows PC and did a restart but still same error. It only works in Colab for me.

echarlaix · July 27, 2022, 8:16am

Let’s move this discussion to Symlink error when importing ORTSeqClass model via Pipeline as you are describing the same problem there.

Topic		Replies	Views
Optimum roberta base quantization model recall drop 10% 🤗Optimum	5	470	January 15, 2024
Load pytorch trained model via optimum 🤗Optimum	5	2816	August 10, 2022
Improving Quantization Accuracy for ONNX Models with Optimum 🤗Optimum	0	731	February 8, 2024
Optimum library optimization and quantization fails 🤗Optimum	8	1560	February 22, 2025
Dynamic quantization problems 🤗Optimum	4	2248	October 16, 2022

Optimum & RoBERTa: how far can we trust a quantized model against its pytorch version?

Quantization of the model

Inference (get logits)

Prediction

Results

Related topics