Order between optimization and quantization

We can perform model quantization and optimization, both separately and combinedly. When we want apply both, is there an order that should be followed?

Is the order task dependent or dependent on the optimization being applied?

TLDR; The order is Optimization and then Quantization.

  • When using the cli to export a model that generates (includes tasks like text-generation, text2text-generation, automatic-speech-recognition, etc.), optimizations should be specified during the export, this way they are applied before the post-processing, specifically before any model merging which comes with changes to the graph.
  • Quantization on the other hand should be applied after optimization and merging (both performed with the export cli command). The reason is that quantization will introduce new quantized operators (nodes in the graph) that might interfere with the other two steps.
1 Like