Dynamic quantization problems

Finetuned distilbert-base-multilingual-cased on XNLI


I used the provided dynamic quantization API and exported the model-quantized.onnx, and load the onnx to pipeline to test the accuracy.

It seems like the model-quantized.onnx is exported without weights… If I load the model.onnx, the accuracy back to normal. Is there something I missed in this part? How can I measure the accuracy for the quantized model?

from optimum.onnxruntime import ORTQuantizer
from optimum.onnxruntime.configuration import AutoQuantizationConfig
from optimum.onnxruntime import ORTModelForSequenceClassification
from pathlib import Path
from transformers import AutoTokenizer, pipeline, DistilBertTokenizer, DistilBertForSequenceClassification, EvalPrediction
from tqdm import tqdm
import time
from evaluate import evaluator
from datasets import load_dataset, load_metric
import numpy as np

model_path = "/tmp/en_en"
onnx_path = Path('./onnx/')

quantizer = ORTQuantizer.from_pretrained(model_path, feature="sequence-classification")
qconfig = AutoQuantizationConfig.avx512_vnni(is_static=False, per_channel=True)
    onnx_model_path=onnx_path / "model.onnx",
    onnx_quantized_model_output_path=onnx_path / "model-quantized.onnx",

quantizer.model.config.save_pretrained(onnx_path) # saves config.json
model = ORTModelForSequenceClassification.from_pretrained(onnx_path,file_name="model.onnx")

eval_dataset = load_dataset("xnli", "en", split="validation")
task_evaluator = evaluator("text-classification")

def preprocess_function(example):
    example["input"] = {"text": example["premise"], "text_pair": example["hypothesis"]}
    return example

eval_dataset = eval_dataset.map(
    desc="Running tokenizer on train dataset",

tokenizer = AutoTokenizer.from_pretrained(model_path)
onnx_classifier = pipeline('text-classification', model=model, tokenizer=tokenizer)

eval_results = task_evaluator.compute(
    label_mapping={"LABEL_0": 0, "LABEL_1": 1, "LABEL_2": 2},



accuracy: 0.7815261044176707,


accuracy: 0.3333333333333333

The accuracy is the same with the accuracy of the original distilbert-base-multilingual-cased model on XNLI… Did I miss something? I also tested on optimum==1.4.0, this issue is still there. The eval result for dynamic quantization is bad.

But when I switched to the torch.quantization.quantize_dynamic, it works fine, as the accuracy just drop a little.

accuracy: 0.7249

Hi @bubblesxin! Could you try with per_channel=False in your quantization config please? I think per-channel quantization messes up with the classification layer at the end of the model.

Thank you @regisss , indeed, the problem lies in the per_channel. But the accuracy is still bad compared to the original model. It’s 72.30 using the code above. But in most of the case we should expect 1% or 2% of accuracy drop, right? Is that because the model is too small?

We should manage to get a smaller drop in accuracy. I quickly trained distilbert-base-multilingual-cased on XNLI en (you can see the results here) and with the following quantization config I got better results:

qconfig = QuantizationConfig(
    format=QuantFormat.QOperator,  # Same as AutoQuantizationConfig.avx512_vnni for dynamic quantization
    mode=QuantizationMode.IntegerOps,  # Same as AutoQuantizationConfig.avx512_vnni for dynamic quantization
    operators_to_quantize=["MatMul", "Add"],  # Same as AutoQuantizationConfig.avx512_vnni for dynamic quantization

The doc of ONNXRuntime says here that in some cases reduce_range should be set to True when using per-channel quantization, which is why I tried it and it worked well. I guess this mitigates the impact of per-channel quantization on the classification layer while still getting better quantization on other layers.

Could you please try this configuration and let me know how it goes @bubblesxin?

Thank you for your reply! I got 77.5 accuracy using your qconfig code. Thank you so much.

I also tested on XNLI ES, the quantization model is even better than the original one on eval.

1 Like