Hugging Face Reads - 01/2021 - Sparsity and Pruning

:hugs: Hugging Face Reads :hugs:

January 2021 - Sparsity and Pruning

By Victor Sanh, François Lagunas, and Yacine Jernite

Introduction to the Hugging Face Reads (HFR) series

New year, new Hugging Face reading group :hugs:! We are launching the Hugging Face Reads (HFR) series: each month, we will choose a topic to focus on, reading a set of four papers recently published on the subject. We will then write a short blog post summarizing their findings and the common trends between them, questions we had for follow-up work after reading them, and how recent advances in the area have affected our work at HF.

The first topic for January 2021 is Sparsity and Pruning, and we are planning to address Long-Range Attention in Transformers next month. Enjoy, and come join the conversation here!


While large-scale pre-trained language models help solve an ever-growing set of natural language processing tasks, the progressive increase in their sizes raises concerns about their wide-scale applicability in practical settings, especially on devices with limited memory and computing power.

Sparse neural network models which only use a fraction of the large parameter sets of their dense equivalents offer a promising avenue to reduce these computational demands. Recent works have proposed various methods to achieve impressive levels of sparsity, whether by gradually choosing which parameters to retain during training or by “pruning” the parameter set after the fact. This post presents an overview of four papers proposing or analyzing such methods. We review the following works: the (Chen et al., NeurIPS 2020) paper investigating the applicability of the Lottery Ticket Hypothesis to BERT-style models, the (Frankle et al., 2020) analysis of currently available methods to find sparsity patterns at initialization before doing any training, the (Li et al., 2020) work on the computational and performance trade-offs between training a large model to prune later vs. training smaller models right away, and the (Hooker et al., 2020) study of the biases introduced by current methods used to compress models (including pruning).

Paper summaries

For each paper, we identify some of the claims and contributions, as well as some follow-up questions.

The Lottery Ticket Hypothesis for Pre-trained BERT Networks

By Tianlong Chen, Jonathan Frankle, Shiyu Chang, Sijia Liu, Yang Zhang, Zhangyang Wang, Michael Carbin

The Lottery Ticket Hypothesis (LTH) was initially developed and tested on computer vision systems. It states that given an initialization of a model, it is possible to find a subset of sufficient parameters during training: i.e., such that training only those parameters while setting the others to zero allows the model to reach the same performance as training the full model. Unfortunately, this subset can only be found after some amount of computation, and the method requires several iterations of re-training (either from scratch or from an earlier checkpoint, a method known as rewinding) and pruning for full effect. However, the approach can still end up improving training time and outputs a ready-to-use sparse model.

This paper sets out to validate the LTH in NLP (and in particular in BERT-style models). Specifically, it asks whether sparse subnetworks of a model pre-trained with Masked Language Modeling (MLM) are sufficient to solve down-stream tasks. The answer is broadly positive.


  • Using a pre-trained initialization, BERT contains sparse subnetworks at non-trivial sparsities that can be fine-tuned in isolation to full performance on a range of downstream tasks.
  • As opposed to previous work, these subnetworks are found at pre-trained initialization and not at random initialization (which was the case with the original LTH work). Rewinding does not significantly improve accuracy on downstream tasks.
  • There are universal subnetworks that transfer to all studied downstream tasks. By further fine-tuning on the same task that was used for pre-training (Masked Language Modeling), the method finds a 70% sparse sub-network that can yield good results on all downstream applications.

Follow-up questions

  • In practice, the computational cost of fine-tuning is already much less than that of pre-training. How would “fine-pruning” (pruning while fine-tuning with methods such as movement pruning) a model on a downstream task compare to using the LTH-sparse model obtained with MLM (or with the downstream task)?
  • The lack of impact of rewinding is in stark contrast with previous work on networks initialized from scratch and bears closer examination. For example, does this finding hold across fine-tuning learning rates? How much does the value of the selected parameters change over time?

Pruning Neural Networks at Initialization: Why are We Missing the Mark?

By Jonathan Frankle, Gintare Karolina Dziugaite, Daniel M. Roy, Michael Carbin

This paper analyzes the performance of several methods to prune networks at initialization, so even before training starts, to save on training time as the network is smaller or sparse (SNIP, GraSP, SynFlow, Magnitude pruning). The methods are allowed to sample the dataset to perform the pruning: this sampling can be considered negligible compared to the computation required for training. They compare the methods to two “upper bounds” representing the performance we can hope to achieve when given access to information that is available after training: Magnitude Pruning and Lottery Ticket Rewinding.


  • All proposed methods are better than random pruning, but they are not sensitive to the individual selection weights, only to pruning proportions on each layer. Even worse, selecting the weights with the lowest instead of the highest value of the utility criteria improves performance on some methods (GraSP), which appears to invalidate some of the original works’ claims.
  • The methods are far from competitive with post-training approaches. Moreover, none of the methods is SOTA in all settings: some methods are better at some sparsity levels than others, but this depends on sparsity.
  • The methods yield better results if they are applied after a few training steps rather than right away at initialization, but they need a significant amount of training to approach the proposed “upper bounds”.

Follow-up questions

  • The problem of finding a “good” subnetwork right at initialization seems somewhat under-defined and possibly overly difficult: which task or set of tasks is used to measure success? Is it even possible to find an ideal sub-networks that works on any possible task a priori? Consequently, it is hard to tell whether the mixed results stem from flaws in the methods or from the task’s inherent difficulty. More insights here would be particularly enlightening.
  • The authors note that the studied methods prune “layers, not weights”, which may explain the surprising results they obtain by inverting the weight selection. In that case, would a dense model with adaptive layer sizes following the same patterns work as well?
  • An interesting follow-up direction could be something along the lines of “pruning as soon as possible”. Recent “Bertology” work has shown that pre-trained models learn different levels of skill in sequence: we are particularly eager to see follow up work that explores the relationship between the emergence of these skills and the optimal pruning strategy.

Train Large then Compress, Rethinking Model Size for Efficient Training and Inference of Transformers

By Zhuohan Li, Eric Wallace, Sheng Shen, Kevin Lin, Kurt Keutzer, Dan Klein, Joseph E. Gonzalez

This paper explores the landscape of the computational tradeoffs between transformer models’ sizes and the required numbers of hyper-parameter settings and training steps to achieve a good performance. It finds larger sizes can allow for fewer hyper-parameter settings and training steps and offers some practical advice about choosing a larger initial number of parameters that can later be pruned to, counter-intuitively, reduce the overall computational cost of training a mode when compared to just training a smaller model from scratch.


  • Large models are faster to train: they reach a given precision faster when measuring optimizing steps/wall clock time/ flops, even when they are stopped before convergence. Absolute size is more important than depth or width alone, but depth can be more important than width in some cases. The faster convergence usually makes up for the faster execution of smaller models.
  • Large models can be compressed to smaller networks. Training large networks might speed up training but would lead to problems at inference time, as their resource cost is much higher. This work finds that pruning them to networks that end up containing fewer parameters than the original smaller alternatives still yields higher performance. They can be quantized too with less quantization error.
  • Batch size has an influence on training speed. In practice, this means that gradient accumulation should be used for larger models.

Follow-up questions

  • The results are impressive, but it can still be difficult to get some intuition for why the larger models converge to a better state faster and are easier to prune. The authors mention previous work hinting that deeper networks “promote movement along directions already taken” as a possible explanation, but we are definitely looking forward to reading further analysis.
  • The connection to Lottery Ticket Hypothesis is mentioned only in passing. Further work exploring whether the sub-networks selected by the two approaches are similar in any fashion (such as by considering the Jaccard distance between the sets).

Characterizing Bias in Compressed Models

By Sara Hooker, Nyalleng Moorosi, Gregory Clark, Samy Bengio, Emily Denton

This paper sheds light on the impact of pruning on neural models for vision and shows that reported top-line accuracies often hide the disproportionate negative impact on certain classes of inputs. The paper connects this phenomenon to bias and fairness considerations.


  • While the overall error is largely unchanged when a model is compressed (by pruning and quantization), there is a set of data that bears a disproportionately high portion of the error, with their accuracy falling by up to 50% while the overall performance only decreases by 1%, regardless of what the original accuracy was on the group.
  • These examples (or at least some of them) can be consistently identified by comparing the predictions from a population of compressed models with the predictions from a population of non-compressed models on the same inputs: the examples where the predictions distributions diverge are called Compressed Identified Examples (CIE).
    CIE often correspond to low-frequency patterns in the inputs. Compression cannibalizes performance on low-frequency patterns in order to optimize the performance on higher-frequency patterns and preserve the overall accuracy.
  • Compression thus amplifies biases of models (amplifying certain errors on certain types of inputs). The authors suggest using CIE as an auditing tool for compressed models: surfacing a tractable subset of the data for further inspection by domain experts to assess this issue.

Follow-up questions

  • This paper studies are pruning and quantization techniques that are run after training. One question that remains open is whether the models are facing an issue of modeling capacity (i.e., less-biased predictions require more representation power) or whether it is tied to the training procedure. Analyzing methods that reduce model size in the course of training or approaches such as gradual pruning would shed some light on this question.
  • Would up-weighting the CIE examples in training lead to models that are more robust to compression? Or would we expect to find different CIE groups?
  • The authors suggest using CIE as a diagnostic tool. What can be done with the diagnostic? Are there other calls to action from these insights? For instance, how could we change existing benchmarks on compression to include robustness metrics (i.e., adding another component to the tradeoff size vs. accuracy on CIE groups)?

Reading Group Discussion

The quantitative results obtained on many of the common benchmark tasks by pruning are impressive. At the same time, they also remind us how much we still have to learn about the training dynamics of neural networks. Common wisdom states that “overparameterization helps with optimization”, but we have little theory available to help us understand the phenomenon further, especially in the deep attention-based models that perform so well in NLP.

Each of the four papers above offers a different view of this question of modeling capacity vs. optimization vs. generalization.

The Lottery Ticket Hypothesis relies on the quality of the initial state of the parameters at least as much as on the evolution of the weight values during optimization. As such, the main purpose of increasing the number of parameters would be to exponentially increase the chances of hitting a good sub-network at initialization.

Other approaches focus more on how and whether the gradient flowing through the possibly redundant parameters help optimize the value of the ones we want to keep in the final pruned network: whether they try to evaluate that impact a priori as in the SynFlow algorithm or are content to simply keep them around for optimization based on their empirically proven efficiency and to prune them at the end of the training.

All of the works outlined above, however, assume that the neural networks are indeed over-parameterized and that they can be pruned without changing their qualitative behavior. The CIE work questions that assumption and finds that pruning does change the behavior of the model in non-trivial ways. This assessment also agrees with some experiments Victor Sanh has run on the task for natural language inference, gradually pruning a model trained on multiNLI and testing it on the HANS dataset. As the sparsity increases, the generalization as measured by the accuracy on the HANS test set decreases and gradually drops to 0 while the performance on the multiNLI test set stays mostly constant. Another experiment along those lines would be to see how much factual knowledge pre-trained language models lose as they are pruned (for example by monitoring closed-book QA accuracy for a model like T5).

The question remains whether this loss of generalization and increased bias is a result of the model losing “expressive capacity” as its number of parameters decreases or whether the fault lies in the compression strategies that aren’t quite flexible enough, but the results certainly suggest that a large number of parameters serves as more than a crutch for optimization.

Another question that is somewhat orthogonal to the one above is that of when to optimally prune weights from the model. Pruning early saves computation, but does not benefit from any signal from the target task. Pruning after training can take advantage of additional information but does not save any computation at training time or allow the parameters to adapt to the new sparsity pattern. Gradually pruning during training seems to provide the best of both worlds, but introduces a new set of hyper-parameters which may make optimization more costly. One should also keep in mind that actual computational gains will depend on the capabilities of current hardware and their ability to take full advantage of shifting sparsity patterns.

We’re very much looking forward to the progress on all of these questions that 2021 is sure to bring!

@HuggingFace :hugs:: Sparsity and Pruning

We first started investigating ways to make existing models more computationally efficient with DistilBERT, a method which was used to train one of our most popular models. The follow-up on sequence-to-sequence models yielded DistilBart, which also reaches similar performances to their larger counterparts at a lesser cost. Recently, we have also investigated approaches which focus on sparsity more specifically.

Movement Pruning

Most of the works referenced above use magnitude pruning, a widely used strategy for pruning which thresholds weight values and simply sets the smallest ones to zero. In our work on Movement Pruning led by Victor Sanh, we argue that this approach is less effective in the context of transfer learning and highlight the importance of considering the changes of weights during fine-tuning as opposed to relying (mostly) on the pre-trained values. Code & hyper-parameters are available here.

Block Movement Pruning

The main drawback of unstructured pruning from a practical point of view is that current hardware can make it quite difficult to take full advantage of the sparsity pattern to accelerate the computation of the network. A compromise that can help alleviate this issue is the use of “semi-structured” sparsity patterns. By selecting blocks (typically 32x32) of weights instead of single weights, and running the same kind of optimization methods.

Accelerating block sparse linear algebra is easier, and the pytorch_block_sparse library developed at Hugging Face is our attempt to show that. We are pretty confident more and more solutions for block-sparsity computation will emerge, and we will be working with major actors to enable it. We are already providing some sample networks that take advantage of block sparsity, so you can judge by yourself!

Finally, we also work to combine block sparsity with other accelerated sparsity patterns such as NVidia Ampere, to further decrease the memory, the compute and the energy used by the neural networks that will be everywhere in the near future.


Hi @VictorSanh I noticed that your implementation of movement pruning involves some masked versions of BERT like MaskedBertForSequenceClassification. Do you know whether these classes will become part of the main library at some point in the future?

Hi! I just wanted to add a wrap-up about our article in this context.

Sparsifying Transformer Models with Differentiable Representation Pooling

By Michał Pietruszka, Łukasz Borchmann

The problem of having quadratic complexity w.r.t. the length of the attention mechanism in transformers is approached using pooling operations for reducing the number of word-vectors in between layers. The paper finds that even a hard selection of word-vectors outperforms Linformer and Reformer-based baselines on a long-documents summarization task both in speed and ROUGE scores. However, this hard selection remains suboptimal, as gradients are not propagated to each element in the sequence. This drawback was eliminated by introducing the novel pooling operation, namely The Successive Halving Differentiable Topk. It allows scoring each element in the sequence and selecting a predetermined number of word-vectors that achieved the highest score.


  • Word-vector pooling allows achieving sub-linear complexity. Keeping the lower sequence length after the pooling is beneficial for the complexity of the next layers, FFNs, and even the decoder’s cross attention. More vectors can be eliminated in subsequent layers, further decreasing complexity as in the Pyramidion model.
  • Massive saving on memory and time (16x and 3.5x respectively) are achieved while outperforming dense baselines at the same time. The time overhead for scoring and pooling is minimal, but the elimination of some information redundancy improves the training significantly.
  • The best models were reusing a part of the saved computations for deepening the network.

Follow-up questions

  • The proposed Successive Halving Top-k operator is universally applicable. How do you want to use that in other fields? What are specific examples of tasks and model architectures?
  • How can other methods (e.g., Linformer) benefit from keeping the lower number of word-vectors?

I hope you will find it interesting! :hugs:


Hi @lewtun , thanks for the question! :slight_smile:
Indeed all the linear layers (torch.nn.Linear) are replaced with custom modules that add scores matrices to accumulate the momentum for pruning.
As of now, we have no plan to include it more broadly in the transformers library even though it is fairly straight-forward to do it: replace all the torch.nn.Linear and change the forward call. I believe @madlag has some code to automatically do that on the fly, maybe he would be open to share about that?

1 Like

Thanks for the answer @VictorSanh! I was able to adapt your implementation to work with a custom Trainer and the latest version of transformers using the approach you suggested. Nevertheless, I would certainly be interested in seeing how the mapping of BERT → MaskedBERT can be done on the fly :slight_smile:

If I may ask a follow-up question: what is the heuristic for picking the number of warmup steps? Is it the first 6% of steps that one sees in the literature (e.g. the RoBERTa paper)?

The reason I ask is that I want to run some experiments on a subset of SQuAD and am wondering how I should scale-down the warmup_steps argument accordingly

Hi @lewtun !

I have almost finished my work on an extension of @VictorSanh work, I tried to make it as generic as possible, to be able to patch any network with only minimal additional work, and to include it in your own training infrastructure.
As @VictorSanh mentionned, it won’t probably be part of transformers, but a standalone tool. I will be releasing it in the following weeks (hopefully before end of month), so I hope you can wait until that point !
To be able to patch a network “on the fly” you can use the approach I used in pytorch_block_sparse , using inspection of pytorch modules.
(You can see the first results of my work I had a few weeks ago here for example )


Thanks a lot for the pointers @madlag! I didn’t know about your pytorch_block_sparse repo - this is exactly what I’m looking for :slight_smile:

I’ll keep an eye out for the release of your tool - do I understand correctly that this will enable people to incorporate e.g. movement pruning in their workflow or is it more focused on patching networks?


That’s a good question!
In my experience having between 5% and 10% of warmup_steps is a good enough target.

If your question is specifically for movement pruning. the most important thing (especially if you have a smaller dataset) is not to prune too fast (i.e. having a total number of steps sufficiently high). @madlag had some experiments where he just doubled the number of epochs (pruning more slowly) and improved the results I reported in the movement pruning paper.

In the general case, some recent papers also echo this experimental trick ([2006.04884] On the Stability of Fine-tuning BERT: Misconceptions, Explanations, and Strong Baselines, [2006.05987] Revisiting Few-sample BERT Fine-tuning). Having enough epochs help stabilize the training especially for very small datasets.


1 Like

Thanks a lot for the tips and pointers to the literaure @VictorSanh - they’re really useful!

Hi @madlag
Is this extension referring to the method of quantizing the layers other than nn.Linear and nn.LSTM, which are not supported by the quantization api of PyTorch by default?

I have been experiencing issues while quantizing a GPT2 based model, where most of the layers contain nn.Conv1D, using the Movement Pruning notebook in examples.


Hi @mriganktiwari, Francois is referring to his brand-new nn_pruning library that extends the work done by Victor et al on movement pruning and provides an inference speed-up without quantization.

If you’re having trouble quantizing GPT-2, one idea could be to convert it to ONNX format and then optimize the graph with ONNX runtime as done here: onnxruntime/Inference_GPT2_with_OnnxRuntime_on_CPU.ipynb at master · microsoft/onnxruntime · GitHub

I’ve generally found the ONNX runtime supports more operators for quantization and the notebook I linked to shows you how to do it :slight_smile:

1 Like

Thanks @lewtun for the quick response,

I have already quantized my model via ONNX, now I was trying to use pruning as a way of further reducing the size and inference time for my DistilGPT2 model - and thought the Movement pruning notebook from Hugging Face might be helpful.
But I get it, that as of now the PyTorch quantization API does not support quantizing of t he Conv1d module.

So I’ll look for other ways to prune using the Tensorflow version of the model.

Thanks again!

1 Like

Just an idea: what if you prune the model first with nn_pruning, convert to ONNX and then quantize / optimize with ORT?

I’m not sure whether the optimized models produced by nn_pruning (i.e. after the heads/rows/columns with zeroes are removed) can be exported to ONNX format, but this might be a way to get the best of both worlds

1 Like