How to upload a quantized model?

lewtun · June 17, 2021, 10:15am

hey @ftian i had a chat with michael benayoun who ran into a similar issue while developing the quantization modules for the nn_pruning library: https://github.com/huggingface/nn_pruning/tree/main/nn_pruning/modules

as a general advice, he recommends the following:

For static quantization / QAT, things are a bit different, you need to:

Load the model with the proper model config

Apply the same quantization to the model as it was previously done

Load the state dict from the checkpoint on that modified model (at this point, every scale and zero_point should be loaded correctly)

Because we are saving the state_dict and not the graph itself, it is impossible to “guess” where the observers / fake quantizations / quantize nodes were located, so the second step is somehow inevitable (although I am working on graph mode quantization which might solve that). For quantized models (after torch.quantization.convert), I would recommend tracing the model with torchscript, at least that’s what I have done, as it provides anything needed to run inference which is usually the goal when a model was quantized.

of course this isn’t as simple as being able to load a quantized model with from_pretrained so i’ll let @sgugger comment on whether this type of feature would make sense to include in transformers itself

Topic		Replies	Views
Problem with pushing quantized model to hub 🤗Transformers	3	313	October 14, 2024
ValueError: The model is quantized with QuantizationMethod.QUANTO and is not serializable 🤗Transformers	1	343	May 20, 2024
How to push on hub a quantized model 🤗Hub	0	92	July 13, 2024
Upload custom Llama2 model with injected linear layers 🤗Transformers	1	641	August 14, 2024
Pushing a quantized (4bit) model on the Hub 🤗Transformers	9	4252	January 8, 2024

How to upload a quantized model?

Related topics