How to Prune Transformer based Model?

Hi, I am trying to reduce memory and speed up my own fine-tuned transformer. I came across the tutorial for pruning on the huggingface site. I am referring to the following snippet. The trainer.train() is missing, so I added it. It ran without error, however, there is no reduction in memory (I used model.get_memory_footprint() and before and after pruning it was Model memory footprint: 503695916 bytes). Same for inference speed. I also tried out different pruning configurations (global pruning, different pruning types or target sparsities) but it did not help. Can someone help me?

from import INCTrainer
from neural_compressor import WeightPruningConfig
from transformers import TrainingArguments, Trainer
from import default_data_collator

pruning_config = WeightPruningConfig(

from transformers import TrainingArguments, Trainer

trainer = INCTrainer(
args=TrainingArguments(save_dir, max_steps=500,num_train_epochs=1.0, do_train=True, do_eval=True,metric_for_best_model="f1",greater_is_better=True),
train_result = trainer.train() # <-- Added by me
trainer.save_model(save_dir) # <-- Added by me
optimized_model = AutoModelForSequenceClassification.from_pretrained(save_dir)

memory_footprint = optimized_model.get_memory_footprint()
print(f"Model memory footprint: {memory_footprint} bytes")`

Expected behavior
As per the model should be pruned and the actual model without pruned and the pruned model should have different sizes but they have the Model memory footprint:

@ArthurZucker @younesbelkada @amyeroberts @sgugger @ArthurZucker @pacman100 @stas00 @sgugger @muellerzr @sgugger, @stevhliu @MKhalusova

Hi @Jyotiyadav,

You are currently applying magnitude pruning on your model, which is an unstructured pruning method. This results in the “pruned” weights corresponding values to be replaced by 0, the model size will thus not vary. We are currently working with the Intel Neural Compressor team to identify the rows/columns that can be completly remove from the model (when the sparsity is very high in the context of unstructured pruning).
Since optimum-intel v1.7, structured pruning is enabled with NNCF which you can also combine with quantization aware training and distillation, you can find a detailed example in the documentation. For more information about the pruning methodology, you can check out the paper it’s based on.