Should pruning shrink model?; adjusting sparsity didn't change inference time

I’ve tried a couple configurations of the WeightPruningConfig to see if the model size shrinks or the inference time is reduced. For all the things I’ve tried, the exported size of the model remained the same. I’ve tested only on gpt2 for now, as larger models throw exceptions for running out of memory.

Q1. Is the model size with varying pruning types and sparsities expected to change? I started thinking maybe it won’t shrink because it’s replacing a weight with a 0 (right?), not getting rid of the connection. Is this “0” represented as a float32? Maybe that would explain why it isn’t smaller.

Q2. I tried changing the sparsity from 0.9 to 0.1 and the size and inference time remained the same. The code is below, but I pruned GPT2 and compared that to the pruned models. The pruned models produced a 100 token inference in a shorter time (maybe half the time), but very little difference in the time for 0.9 and 0.1 sparsity. Does this mean maybe it didn’t prune much since it hit the target easily? I didn’t see any output saying what the current sparsity was as it was pruning. I also tried chaniging the type to channelx1 (not really knowing what that type means and didn’t see much change.

Q3. I see the default configuration of pruning_op_types is “Conv” and “Linear”. How can I check my model to ensure these operations are in there (I would assume they are, but I’m new to neural networks). Also how can I see that they have been pruned? How can I see what the op type of a given layer is…and maybe see the % of weights that have been set to 0?

Q4: If combining pruning with quantization, which operation should be done first? I assume pruning since then you can use quantization aware training?

Q5: What is the difference between saving the model with:
trainer.save_model(outpath + “1”)
model.save_pretrained(outpath + “2”)
I was unable to load the model in the second case using INCModelForCausalLM.


Full code:

async def prune_for_forum():
    await run(0.9, "outpath90")
    await run(0.1, "outpath10")

async def run(sparsity, outpath):
    model = AutoModelForCausalLM.from_pretrained("gpt2")
    tokenizer = AutoTokenizer.from_pretrained("gpt2")
    tokenizer.pad_token = tokenizer.eos_token

    dataset = await dataset_manager.get_processed_home_depot_dataset_for_training(tokenizer, 1024)

    pruning_config = WeightPruningConfig(target_sparsity=sparsity)
    training_args = TrainingArguments(outpath + "0", use_cpu=False, num_train_epochs=0.5, do_train=False, do_eval=False)

    trainer = INCTrainer(model=model,
                         args=training_args, train_dataset=dataset["train"],
                                        # eval_dataset =dataset["validation"],

    train_result = trainer.train()
    res = trainer.save_model(outpath + "1")
    model.save_pretrained(outpath + "2")

Running them:

    model = intel.INCModelForCausalLM.from_pretrained("outpath901")
    tokenizer = AutoTokenizer.from_pretrained("gpt2")
    pipe = pipeline("text-generation", model=model, tokenizer=tokenizer, device="cpu")
    res = await perf.print_duration("Running pipline",
                                    "Here is a recipe for vegan banana bread:\n",

Hi @rmiller3,

Pruning is applied on both the linear and the convolutional layers. It’s important to mention that the pruning sparsity defined in the configuration will be applied on these layers, and thus will not results in the global model sparsity. Yes with out current Intel Neural Compressor integration, the model size will not be impacted as the “pruned” elements values will be replaced with 0. You can also combine pruning and quantization during your fine-tuning step, you only need to specify the corresponding configuration describing the parameters needed for each steps.

To reduce your model size and speedup inference, you can apply block pruning with NNCF, which you can also combine with quantization aware training, you can find an example in the documentation. After this step, you’ll need to load your model with the corresponding OVModelForXxx class to perform inference with OpenVINO runtime, here is the list of supported devices.

To save the resulting model you’ll need to use the .save_model() method from your trainer, this will save your model along with its corresponding configuration.

1 Like

Thanks…a few followup questions.

On the neural compressor page here, it says Prune parameters that have minimal effect on accuracy to reduce the size of a model. Configure pruning patterns, criteria, and schedule.. Is that technically not accurate since the size doesn’t change unless you quantize?

When should one use the OVTrainer/Vino libraries vs the INCTrainer?

Do you have recommendations for what kind of datasets to use with the trainers when working with text-generative models? One thing that confused me was how the evaluation datasets are used in pruning…since there isn’t really a pass/fail criteria so I ended up having to leave them out.

Is there a way to view the sparsity on a given layer? I tried different sparsities with the existing code and while it did speed up the runtime changing the sparsity from 0.1 to 0.9 didn’t change anything. Is that because the model had already met its sparsity goals? Would there be a way to check?