Dynamic quantization problems

bubblesxin · October 14, 2022, 7:33am

Finetuned distilbert-base-multilingual-cased on XNLI
environment:

transformers==4.20.1
optimum==1.3.0
evaluate==0.2.2

I used the provided dynamic quantization API and exported the model-quantized.onnx, and load the onnx to pipeline to test the accuracy.

It seems like the model-quantized.onnx is exported without weights… If I load the model.onnx, the accuracy back to normal. Is there something I missed in this part? How can I measure the accuracy for the quantized model?

from optimum.onnxruntime import ORTQuantizer
from optimum.onnxruntime.configuration import AutoQuantizationConfig
from optimum.onnxruntime import ORTModelForSequenceClassification
from pathlib import Path
from transformers import AutoTokenizer, pipeline, DistilBertTokenizer, DistilBertForSequenceClassification, EvalPrediction
from tqdm import tqdm
import time
from evaluate import evaluator
from datasets import load_dataset, load_metric
import numpy as np

model_path = "/tmp/en_en"
onnx_path = Path('./onnx/')
onnx_path.mkdir(exist_ok=True)

quantizer = ORTQuantizer.from_pretrained(model_path, feature="sequence-classification")
qconfig = AutoQuantizationConfig.avx512_vnni(is_static=False, per_channel=True)
quantizer.export(
    onnx_model_path=onnx_path / "model.onnx",
    onnx_quantized_model_output_path=onnx_path / "model-quantized.onnx",
    quantization_config=qconfig,
)

quantizer.model.config.save_pretrained(onnx_path) # saves config.json
model = ORTModelForSequenceClassification.from_pretrained(onnx_path,file_name="model.onnx")

eval_dataset = load_dataset("xnli", "en", split="validation")
task_evaluator = evaluator("text-classification")

def preprocess_function(example):
    example["input"] = {"text": example["premise"], "text_pair": example["hypothesis"]}
    return example

eval_dataset = eval_dataset.map(
    preprocess_function,
    batched=False,
    desc="Running tokenizer on train dataset",
)

tokenizer = AutoTokenizer.from_pretrained(model_path)
onnx_classifier = pipeline('text-classification', model=model, tokenizer=tokenizer)

eval_results = task_evaluator.compute(
    model_or_pipeline=onnx_classifier,
    tokenizer=tokenizer,
    metric=load_metric("xnli"),
    input_column="input",
    label_column="label",
    data=eval_dataset,
    label_mapping={"LABEL_0": 0, "LABEL_1": 1, "LABEL_2": 2},
)

print(eval_results)

model.onnx:

accuracy: 0.7815261044176707,

model-quantized.onnx

accuracy: 0.3333333333333333

The accuracy is the same with the accuracy of the original distilbert-base-multilingual-cased model on XNLI… Did I miss something? I also tested on optimum==1.4.0, this issue is still there. The eval result for dynamic quantization is bad.

But when I switched to the torch.quantization.quantize_dynamic, it works fine, as the accuracy just drop a little.

accuracy: 0.7249

regisss · October 15, 2022, 5:49pm

Hi @bubblesxin! Could you try with per_channel=False in your quantization config please? I think per-channel quantization messes up with the classification layer at the end of the model.

bubblesxin · October 16, 2022, 4:53am

Thank you @regisss , indeed, the problem lies in the per_channel. But the accuracy is still bad compared to the original model. It’s 72.30 using the code above. But in most of the case we should expect 1% or 2% of accuracy drop, right? Is that because the model is too small?

regisss · October 16, 2022, 8:00am

We should manage to get a smaller drop in accuracy. I quickly trained distilbert-base-multilingual-cased on XNLI en (you can see the results here) and with the following quantization config I got better results:

qconfig = QuantizationConfig(
    is_static=False,
    format=QuantFormat.QOperator,  # Same as AutoQuantizationConfig.avx512_vnni for dynamic quantization
    mode=QuantizationMode.IntegerOps,  # Same as AutoQuantizationConfig.avx512_vnni for dynamic quantization
    per_channel=True,
    reduce_range=True,
    operators_to_quantize=["MatMul", "Add"],  # Same as AutoQuantizationConfig.avx512_vnni for dynamic quantization
)

The doc of ONNXRuntime says here that in some cases reduce_range should be set to True when using per-channel quantization, which is why I tried it and it worked well. I guess this mitigates the impact of per-channel quantization on the classification layer while still getting better quantization on other layers.

Could you please try this configuration and let me know how it goes @bubblesxin?

bubblesxin · October 16, 2022, 9:36am

Thank you for your reply! I got 77.5 accuracy using your qconfig code. Thank you so much.

I also tested on XNLI ES, the quantization model is even better than the original one on eval.

Topic		Replies	Views
Improving Quantization Accuracy for ONNX Models with Optimum 🤗Optimum	0	655	February 8, 2024
Quantized Model size difference when using Optimum vs. Onnxruntime 🤗Optimum	3	1496	July 14, 2022
Static quantization of gpt2-style models with ORTQuantizer 🤗Optimum	3	842	September 18, 2023
Quantization of facebook/opt-13b model 🤗Transformers	0	983	July 28, 2022
Optimum v1.1.0 breaking problems 🤗Optimum	1	1163	April 26, 2022

Dynamic quantization problems

Related topics