Hi, I am wondering if it’s possible to upload a quantized model? from “model sharing” doc, looks like we could only upload some fine-tune models based on HF transformer models.
I learned something from i-BERT, which is a Quantization-Aware-Training model. My question is it possible to upload a int8 transfomer model through Post-Training-Quantization rather Quantization-Aware-Training?
The difference between Post-Training-Quantization and Quantization-Aware-Training is the former is only related with inference phase (calibration tensor range and quantize/dequantize to int8/fp32 for perf speedup) but the latter will emulate the quantization precision loss by inserting fake_quant ops in training phase.
as the Post-Training-Quantization only involves inference phase (qconfig setting in PyTorch) and (graph rewrite In TensorFlow), I don’t know if it’s possible to upload this quantized model? through which API?
You can upload any model you want on the hub since it’s git-based. It may not work out of the box with the Transformers library if there is no corresponding class, but you can still share the weights this way.
The problem is quantized weights is not enough for PyTorch INT8 inference. It’s a defect in PyTorch quantization implementation, which only allow on-the-fly quantization and on-the-fly inference (an intermediate python object “q_config” is generated in quantization and be used during inference. Note this q_config python object is not saved by PyTorch). If we would like to use this quantized model later or offline, we need load quantized weights and q_config of each node (this is not supported by PyTorch official).
This causes if we want to upload a quantized model to huggingface and user could use huggingface API to download/evaluate this model, we have to provide some codes which can read saved q_weights and q_config to generate a quantized model object and use it to do evaluation.
so it involves some code contributions, just want to confirm with you expertise if it’s a right direction before we put any resource on that.
possible code changes include:
model definition changes (adding quant/dequant stub for PyTorch imperative model and post-training-static-quantization). for example, introduces a q_bert class in huggingface repo.
the model returned from AutoModelForSequenceClassification.from_pretrained(’/path/to/quantized/pytorch/model_a’) should be able to take an additional parameter “q_config”.
if we want user be able to use pipeline(), then this func also need to take an additional parameter q_config.
For static quantization / QAT, things are a bit different, you need to:
Load the model with the proper model config
Apply the same quantization to the model as it was previously done
Load the state dict from the checkpoint on that modified model (at this point, every scale and zero_point should be loaded correctly)
Because we are saving the state_dict and not the graph itself, it is impossible to “guess” where the observers / fake quantizations / quantize nodes were located, so the second step is somehow inevitable (although I am working on graph mode quantization which might solve that). For quantized models (after torch.quantization.convert), I would recommend tracing the model with torchscript, at least that’s what I have done, as it provides anything needed to run inference which is usually the goal when a model was quantized.
of course this isn’t as simple as being able to load a quantized model with from_pretrained so i’ll let @sgugger comment on whether this type of feature would make sense to include in transformers itself