Optimum Pruning and Quantization Current Limitation

samuelmat19 · January 21, 2022, 2:38pm

We are checking out the Huggingface Optimum. There are some issues that we would like to clarify:

Pruning does not always speedup model, and it may increase the model’s storage size which is not expected.
Dynamic quantization works only on CPU (Running it on GPU shows error conflict between CPU and GPU

Could someone or developer in the area explain this behavior? We think the Huggingface Optimum has a high hope for model compression.

If some details are necessary, I would be glad to clarify more.

echarlaix · April 25, 2022, 1:30pm

Hi @samuelmat19

Pruning does not always speedup model, and it may increase the model’s storage size which is not expected.

Currently, the supported pruning method is a magnitude based unstructured method and results in the pruned weights corresponding values to be replaced by 0. The model size should thus not vary and no speed up should be expected.

Dynamic quantization works only on CPU (Running it on GPU shows error conflict between CPU and GPU

Concerning dynamic quantization, unfortunately at the moment PyTorch doesn’t provide quantized operator implementations on CUDA (only CPU backends are available).

samuelmat19 · April 25, 2022, 2:37pm

Hi @echarlaix

That is clear to me. I also opened an issue on Github, in which it was clarified.

Would it make sense to add some writings in the Optimum’s documentation, which emphasizes the fact that this pruning does not provide speed up or reduction in model size? I could imagine there are some people that are trying to use the pruning for speed up, and would be perplexed when it provides no speed up. This happened to me, and spent some time to figure out what I did wrong.

echarlaix · April 25, 2022, 4:12pm

Yes and as stated in the issue you are referring to, we are planning to add additional pruning methods in the future, bear in mind that this is still a work in progress. We will also make our documentation more detailed in order to make things more understandable for the user.

samuelmat19 · April 26, 2022, 4:54am

@echarlaix that is lovely and I appreciate the work being put here in model optimization. Also, if there is something I can contribute in the library, do let me know, I would be glad to help.

Cheers.

Topic		Replies	Views
Should pruning shrink model?; adjusting sparsity didn't change inference time 🤗Optimum	2	772	February 29, 2024
CPU Optimization PyTorch Strategies Intermediate	1	598	February 1, 2022
Model performs worse on Hugging Space vs on my local machine Spaces	1	29	February 15, 2025
Optimum library optimization and quantization fails 🤗Optimum	8	1529	February 22, 2025
Pegasus Model Weights Compression/Pruning Models	14	4247	February 15, 2023

Optimum Pruning and Quantization Current Limitation

Related topics