How to upload a quantized model?

@sgugger @lewtun thanks for the reply.

The problem is quantized weights is not enough for PyTorch INT8 inference. It’s a defect in PyTorch quantization implementation, which only allow on-the-fly quantization and on-the-fly inference (an intermediate python object “q_config” is generated in quantization and be used during inference. Note this q_config python object is not saved by PyTorch). If we would like to use this quantized model later or offline, we need load quantized weights and q_config of each node (this is not supported by PyTorch official).

This causes if we want to upload a quantized model to huggingface and user could use huggingface API to download/evaluate this model, we have to provide some codes which can read saved q_weights and q_config to generate a quantized model object and use it to do evaluation.

so it involves some code contributions, just want to confirm with you expertise if it’s a right direction before we put any resource on that.

possible code changes include:

  1. model definition changes (adding quant/dequant stub for PyTorch imperative model and post-training-static-quantization). for example, introduces a q_bert class in huggingface repo.
  2. the model returned from AutoModelForSequenceClassification.from_pretrained(’/path/to/quantized/pytorch/model_a’) should be able to take an additional parameter “q_config”.
  3. if we want user be able to use pipeline(), then this func also need to take an additional parameter q_config.

Appreciate any guidence