Quantization of facebook/opt-13b model

Hi,
I have a finetuned model of facebook/opt-13b model locally. I want to get a quantized model to shrink the model size and have faster inference. I did the conversion to onnx (after spending a lot of time). I found one PR here. The code works for all opt versions except the opt-13b that I need. So I had some discussion here and made some changes to finally quantize the model. But the quantized model output is not good at all. I asked here few days ago but didn’t get any answer.

The code:

from onnxruntime.quantization.calibrate import CalibrationMethod
from onnxruntime.quantization import quantize_dynamic, QuantType, quantize_static

quantize_dynamic(
    "onnx2/model.onnx",
    "onnx-quantized2/model-int8.onnx",
    weight_type=QuantType.QUInt8,
    use_external_data_format=True
)

The quantized model output :

This is an award winning short story titled The Drive. This story titled The A
 A story  A A  A A A
 A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A

It should be sth like this:

This is an award winning short story titled The Drive. This story is written with descriptive language, described in detail. This is the first chapter of The Drive.\n\nIn this chapter I am a professional driver. I am driving a car from San Francisco to Los Angeles. I have a female passenger who is a famous photographer. She is taking a photo of me as I drive.\n\n#action, #driving, #fiction, #funny, #funny, #driving,'

What do you think is the problem?

1 Like