Hi,
I have a finetuned model of facebook/opt-13b model locally. I want to get a quantized model to shrink the model size and have faster inference. I did the conversion to onnx (after spending a lot of time). I found one PR here. The code works for all opt versions except the opt-13b that I need. So I had some discussion here and made some changes to finally quantize the model. But the quantized model output is not good at all. I asked here few days ago but didn’t get any answer.
The code:
from onnxruntime.quantization.calibrate import CalibrationMethod
from onnxruntime.quantization import quantize_dynamic, QuantType, quantize_static
quantize_dynamic(
"onnx2/model.onnx",
"onnx-quantized2/model-int8.onnx",
weight_type=QuantType.QUInt8,
use_external_data_format=True
)
The quantized model output :
This is an award winning short story titled The Drive. This story titled The A
A story A A A A A
A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A
It should be sth like this:
This is an award winning short story titled The Drive. This story is written with descriptive language, described in detail. This is the first chapter of The Drive.\n\nIn this chapter I am a professional driver. I am driving a car from San Francisco to Los Angeles. I have a female passenger who is a famous photographer. She is taking a photo of me as I drive.\n\n#action, #driving, #fiction, #funny, #funny, #driving,'
What do you think is the problem?