How to upload a quantized model?

ftian · June 10, 2021, 2:15am

Hi, I am wondering if it’s possible to upload a quantized model? from “model sharing” doc, looks like we could only upload some fine-tune models based on HF transformer models.

I learned something from i-BERT, which is a Quantization-Aware-Training model. My question is it possible to upload a int8 transfomer model through Post-Training-Quantization rather Quantization-Aware-Training?

The difference between Post-Training-Quantization and Quantization-Aware-Training is the former is only related with inference phase (calibration tensor range and quantize/dequantize to int8/fp32 for perf speedup) but the latter will emulate the quantization precision loss by inserting fake_quant ops in training phase.

as the Post-Training-Quantization only involves inference phase (qconfig setting in PyTorch) and (graph rewrite In TensorFlow), I don’t know if it’s possible to upload this quantized model? through which API?

Thanks for any guidence

sgugger · June 10, 2021, 8:23pm

You can upload any model you want on the hub since it’s git-based. It may not work out of the box with the Transformers library if there is no corresponding class, but you can still share the weights this way.

lewtun · June 11, 2021, 8:54am

in case it’s useful, i’ve also answered in another thread some of the main steps you need to re-load the quantized weights using pytorch’s state_dict: Pegasus Model Weights Compression/Pruning - #9 by lewtun

ftian · June 15, 2021, 3:59am

@sgugger @lewtun thanks for the reply.

The problem is quantized weights is not enough for PyTorch INT8 inference. It’s a defect in PyTorch quantization implementation, which only allow on-the-fly quantization and on-the-fly inference (an intermediate python object “q_config” is generated in quantization and be used during inference. Note this q_config python object is not saved by PyTorch). If we would like to use this quantized model later or offline, we need load quantized weights and q_config of each node (this is not supported by PyTorch official).

This causes if we want to upload a quantized model to huggingface and user could use huggingface API to download/evaluate this model, we have to provide some codes which can read saved q_weights and q_config to generate a quantized model object and use it to do evaluation.

so it involves some code contributions, just want to confirm with you expertise if it’s a right direction before we put any resource on that.

possible code changes include:

model definition changes (adding quant/dequant stub for PyTorch imperative model and post-training-static-quantization). for example, introduces a q_bert class in huggingface repo.
the model returned from AutoModelForSequenceClassification.from_pretrained(’/path/to/quantized/pytorch/model_a’) should be able to take an additional parameter “q_config”.
if we want user be able to use pipeline(), then this func also need to take an additional parameter q_config.

Appreciate any guidence

ftian · June 17, 2021, 1:53am

@sgugger @lewtun Could you pls share some thoughts on that?

lewtun · June 17, 2021, 10:15am

hey @ftian i had a chat with michael benayoun who ran into a similar issue while developing the quantization modules for the nn_pruning library: https://github.com/huggingface/nn_pruning/tree/main/nn_pruning/modules

as a general advice, he recommends the following:

For static quantization / QAT, things are a bit different, you need to:

Load the model with the proper model config

Apply the same quantization to the model as it was previously done

Load the state dict from the checkpoint on that modified model (at this point, every scale and zero_point should be loaded correctly)

Because we are saving the state_dict and not the graph itself, it is impossible to “guess” where the observers / fake quantizations / quantize nodes were located, so the second step is somehow inevitable (although I am working on graph mode quantization which might solve that). For quantized models (after torch.quantization.convert), I would recommend tracing the model with torchscript, at least that’s what I have done, as it provides anything needed to run inference which is usually the goal when a model was quantized.

of course this isn’t as simple as being able to load a quantized model with from_pretrained so i’ll let @sgugger comment on whether this type of feature would make sense to include in transformers itself

Topic		Replies	Views
Problem with pushing quantized model to hub 🤗Transformers	3	292	October 14, 2024
ValueError: The model is quantized with QuantizationMethod.QUANTO and is not serializable 🤗Transformers	1	329	May 20, 2024
How to push on hub a quantized model 🤗Hub	0	88	July 13, 2024
Upload custom Llama2 model with injected linear layers 🤗Transformers	1	638	August 14, 2024
Pushing a quantized (4bit) model on the Hub 🤗Transformers	9	4240	January 8, 2024

How to upload a quantized model?

Related topics