How to upload a quantized model?

Hi, I am wondering if it’s possible to upload a quantized model? from “model sharing” doc, looks like we could only upload some fine-tune models based on HF transformer models.

I learned something from i-BERT, which is a Quantization-Aware-Training model. My question is it possible to upload a int8 transfomer model through Post-Training-Quantization rather Quantization-Aware-Training?

The difference between Post-Training-Quantization and Quantization-Aware-Training is the former is only related with inference phase (calibration tensor range and quantize/dequantize to int8/fp32 for perf speedup) but the latter will emulate the quantization precision loss by inserting fake_quant ops in training phase.

as the Post-Training-Quantization only involves inference phase (qconfig setting in PyTorch) and (graph rewrite In TensorFlow), I don’t know if it’s possible to upload this quantized model? through which API?

Thanks for any guidence

You can upload any model you want on the hub since it’s git-based. It may not work out of the box with the Transformers library if there is no corresponding class, but you can still share the weights this way.

in case it’s useful, i’ve also answered in another thread some of the main steps you need to re-load the quantized weights using pytorch’s state_dict: Pegasus Model Weights Compression/Pruning - #9 by lewtun

@sgugger @lewtun thanks for the reply.

The problem is quantized weights is not enough for PyTorch INT8 inference. It’s a defect in PyTorch quantization implementation, which only allow on-the-fly quantization and on-the-fly inference (an intermediate python object “q_config” is generated in quantization and be used during inference. Note this q_config python object is not saved by PyTorch). If we would like to use this quantized model later or offline, we need load quantized weights and q_config of each node (this is not supported by PyTorch official).

This causes if we want to upload a quantized model to huggingface and user could use huggingface API to download/evaluate this model, we have to provide some codes which can read saved q_weights and q_config to generate a quantized model object and use it to do evaluation.

so it involves some code contributions, just want to confirm with you expertise if it’s a right direction before we put any resource on that.

possible code changes include:

  1. model definition changes (adding quant/dequant stub for PyTorch imperative model and post-training-static-quantization). for example, introduces a q_bert class in huggingface repo.
  2. the model returned from AutoModelForSequenceClassification.from_pretrained(’/path/to/quantized/pytorch/model_a’) should be able to take an additional parameter “q_config”.
  3. if we want user be able to use pipeline(), then this func also need to take an additional parameter q_config.

Appreciate any guidence