Difference in the vector generated by the int8 quantized model vs base onnx model

Rkoy · July 23, 2025, 1:24pm

Recently i tried to compare the BAAI/bge-m3 Onnx model vs its int8 AVX2 quantized version. I see that there is a huge difference in the vectors generated by the base Onnx model vs the vectors generated by the int8 avx2 quantized. The difference was big even though i tried to use different quantization instructions like AVX512, AVX512_VNNI.

Following are the commands used for onnx and int 8 avx2 conversion

Base onnx model
optimum-cli export onnx --model BAAI/bge-m3 bge-m3-base-onnx

Int8 AVX2 quantized model
optimum-cli onnxruntime quantize --onnx model bge-m3-base-onnx --avx2 -o bge-m3-int8-avx2

I would like to know why is there a huge diffference in the vectors generated by both the models or is it something i am wrongly converting ?

John6666 · July 24, 2025, 12:15am

Quantization, or even just casting, can cause slight changes in inference results, which is normal. However, if the results diverge significantly, there may be an issue.

For example, the framework may be configured to skip certain operations, or there may be an unknown bug.

github.com/microsoft/onnxruntime

Incorrect INT8 quantized BERT model prediction on AVX2 only CPU

opened 05:24PM - 02 Dec 20 UTC

closed 08:42AM - 04 Dec 20 UTC

jaroslavgratz

Inference of a BERT INT8 quantized model gives incorrect result on AVX2 only (no… AVX-512 support) CPU. Please note this issue is related to INT8 quantization and AVX2 CPU only . INT8 quantized model works correctly on AVX-512 CPU. UINT8 (unsigned int) quantized model works correctly on both AVX2 CPU and AVX-512 CPU. It is possible to reproduce this bug on any CPU by running inference under valgrind (valgrind does't support AVX-512). Moreover valgrind reports access to uninitialized variables during inference (likely to be related to incorrect prediction result): ``` Conditional jump or move depends on uninitialised value(s) at 0x515C8E7: onnxruntime::DynamicQuantizeLinear<unsigned char>::Compute(onnxruntime::OpKernelContext*) const (in /home/jag/bootstrap/local/lib/libonnxruntime.so.1.5.3) by 0x5346538: onnxruntime::SequentialExecutor::Execute(onnxruntime::SessionState const&, std::vector<int, std::allocator<int> > const&, std::vector<OrtValue, std::allocator<OrtValue> > const&, std::vector<int, std::allocator<int> > const&, std::vector<OrtValue, std::allocator<OrtValue> >&, std::unordered_map<unsigned long, std::function<onnxruntime::common::Status (onnxruntime::TensorShape const&, OrtMemoryInfo const&, OrtValue&, bool&)>, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<std::pair<unsigned long const, std::function<onnxruntime::common::Status (onnxruntime::TensorShape const&, OrtMemoryInfo const&, OrtValue&, bool&)> > > > const&, onnxruntime::logging::Logger const&) (in /home/jag/bootstrap/local/lib/libonnxruntime.so.1.5.3) by 0x533233B: onnxruntime::utils::ExecuteGraphImpl(onnxruntime::SessionState const&, onnxruntime::FeedsFetchesManager const&, std::vector<OrtValue, std::allocator<OrtValue> > const&, std::vector<OrtValue, std::allocator<OrtValue> >&, std::unordered_map<unsigned long, std::function<onnxruntime::common::Status (onnxruntime::TensorShape const&, OrtMemoryInfo const&, OrtValue&, bool&)>, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<std::pair<unsigned long const, std::function<onnxruntime::common::Status (onnxruntime::TensorShape const&, OrtMemoryInfo const&, OrtValue&, bool&)> > > > const&, ExecutionMode, bool const&, onnxruntime::logging::Logger const&, bool) (in /home/jag/bootstrap/local/lib/libonnxruntime.so.1.5.3) by 0x5334318: onnxruntime::utils::ExecuteGraph(onnxruntime::SessionState const&, onnxruntime::FeedsFetchesManager&, std::vector<OrtValue, std::allocator<OrtValue> > const&, std::vector<OrtValue, std::allocator<OrtValue> >&, ExecutionMode, bool const&, onnxruntime::logging::Logger const&, bool) (in /home/jag/bootstrap/local/lib/libonnxruntime.so.1.5.3) by 0x4EAF207: onnxruntime::InferenceSession::Run(OrtRunOptions const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::vector<OrtValue, std::allocator<OrtValue> > const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::vector<OrtValue, std::allocator<OrtValue> >*, std::vector<OrtDevice, std::allocator<OrtDevice> > const*) (in /home/jag/bootstrap/local/lib/libonnxruntime.so.1.5.3) by 0x4E7CF63: OrtApis::Run(OrtSession*, OrtRunOptions const*, char const* const*, OrtValue const* const*, unsigned long, char const* const*, unsigned long, OrtValue**) (in /home/jag/bootstrap/local/lib/libonnxruntime.so.1.5.3) by 0x13A2A6: Run (onnxruntime_cxx_inline.h:475) by 0x13A2A6: Run (onnxruntime_cxx_inline.h:466) ``` **System information** - OS Platform and Distribution: Debian Buster - ONNX Runtime installed from (source or binary): source - ONNX Runtime version: 1.5.3 - Python version: 3.7.3 - GCC/Compiler version (if compiling from source): GCC 8.3.0

github.com/microsoft/onnxruntime

Question on quantize tool

opened 07:54AM - 15 Apr 22 UTC

closed 04:21PM - 08 Jun 22 UTC

yeliang2258

quantization

**Describe the bug** A clear and concise description of what the bug is. To avo…id repetition please make sure this is not one of the known issues mentioned on the respective release page. **I would like to ask, why do you need to delete relu and clip in the quantization tool? Would such an operation result in a precision error ?** **Urgency** If there are particular important use cases blocked by this or strict project-related timelines, please share more information and dates. If there are no hard deadlines, please specify none. **System information** - OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 16.04 - ONNX Runtime installed from (source or binary):binary - ONNX Runtime version: develop - Python version: python3.7 - Visual Studio version (if applicable): - GCC/Compiler version (if compiling from source): - CUDA/cuDNN version: - GPU model and memory: **To Reproduce** - Describe steps/code to reproduce the behavior. - Attach the ONNX model to the issue (where applicable) to expedite investigation. **Expected behavior** A clear and concise description of what you expected to happen. **Screenshots** If applicable, add screenshots to help explain your problem. **I would like to ask, why do you need to delete relu and clip in the quantization tool? Would such an operation result in a precision error ?** ![image](https://user-images.githubusercontent.com/30516196/163541971-e2695c7a-353a-429b-87ce-916e76b44b28.png) **Additional context** Add any other context about the problem here. If the issue is about a particular model, please share the model details as well to facilitate debugging.

Rkoy · July 24, 2025, 6:07am

@John6666 thanks for the reply. Any way to debug what has gone wrong, asking this since, i dont get any errors or warning while quantising the model

John6666 · July 24, 2025, 6:38am

There are ways to output intermediate results for debugging, but I think there are also ways to isolate the problem or search for existing issues that may be related. Also the ONNX official FAQ seems useful. Have you tried reduce-range, for example?

Also, regarding ONNX, you can get reliable information by contacting the ONNX Community members on Hugging Face.

Rkoy · July 25, 2025, 9:08am

Thanks for the response @John6666

Topic		Replies	Views
Quantization of facebook/opt-13b model 🤗Transformers	0	1011	July 28, 2022
Quantized Model size difference when using Optimum vs. Onnxruntime 🤗Optimum	3	1547	July 14, 2022
Optimum & RoBERTa: how far can we trust a quantized model against its pytorch version? 🤗Optimum	10	2437	July 27, 2022
ONNX conversion 🤗Transformers	0	289	July 8, 2021
Onnx export functionality failure for facebook/opt-2.7b with optimum CLI 🤗Transformers	0	338	October 11, 2023

Difference in the vector generated by the int8 quantized model vs base onnx model

Related topics