Hugging Face Reads
January 2021 - Sparsity and Pruning
Introduction to the Hugging Face Reads (HFR) series
New year, new Hugging Face reading group ! We are launching the Hugging Face Reads (HFR) series: each month, we will choose a topic to focus on, reading a set of four papers recently published on the subject. We will then write a short blog post summarizing their findings and the common trends between them, questions we had for follow-up work after reading them, and how recent advances in the area have affected our work at HF.
The first topic for January 2021 is Sparsity and Pruning, and we are planning to address Long-Range Attention in Transformers next month. Enjoy, and come join the conversation here!
While large-scale pre-trained language models help solve an ever-growing set of natural language processing tasks, the progressive increase in their sizes raises concerns about their wide-scale applicability in practical settings, especially on devices with limited memory and computing power.
Sparse neural network models which only use a fraction of the large parameter sets of their dense equivalents offer a promising avenue to reduce these computational demands. Recent works have proposed various methods to achieve impressive levels of sparsity, whether by gradually choosing which parameters to retain during training or by “pruning” the parameter set after the fact. This post presents an overview of four papers proposing or analyzing such methods. We review the following works: the (Chen et al., NeurIPS 2020) paper investigating the applicability of the Lottery Ticket Hypothesis to BERT-style models, the (Frankle et al., 2020) analysis of currently available methods to find sparsity patterns at initialization before doing any training, the (Li et al., 2020) work on the computational and performance trade-offs between training a large model to prune later vs. training smaller models right away, and the (Hooker et al., 2020) study of the biases introduced by current methods used to compress models (including pruning).
For each paper, we identify some of the claims and contributions, as well as some follow-up questions.
By Tianlong Chen, Jonathan Frankle, Shiyu Chang, Sijia Liu, Yang Zhang, Zhangyang Wang, Michael Carbin
The Lottery Ticket Hypothesis (LTH) was initially developed and tested on computer vision systems. It states that given an initialization of a model, it is possible to find a subset of sufficient parameters during training: i.e., such that training only those parameters while setting the others to zero allows the model to reach the same performance as training the full model. Unfortunately, this subset can only be found after some amount of computation, and the method requires several iterations of re-training (either from scratch or from an earlier checkpoint, a method known as rewinding) and pruning for full effect. However, the approach can still end up improving training time and outputs a ready-to-use sparse model.
This paper sets out to validate the LTH in NLP (and in particular in BERT-style models). Specifically, it asks whether sparse subnetworks of a model pre-trained with Masked Language Modeling (MLM) are sufficient to solve down-stream tasks. The answer is broadly positive.
- Using a pre-trained initialization, BERT contains sparse subnetworks at non-trivial sparsities that can be fine-tuned in isolation to full performance on a range of downstream tasks.
- As opposed to previous work, these subnetworks are found at pre-trained initialization and not at random initialization (which was the case with the original LTH work). Rewinding does not significantly improve accuracy on downstream tasks.
- There are universal subnetworks that transfer to all studied downstream tasks. By further fine-tuning on the same task that was used for pre-training (Masked Language Modeling), the method finds a 70% sparse sub-network that can yield good results on all downstream applications.
- In practice, the computational cost of fine-tuning is already much less than that of pre-training. How would “fine-pruning” (pruning while fine-tuning with methods such as movement pruning) a model on a downstream task compare to using the LTH-sparse model obtained with MLM (or with the downstream task)?
- The lack of impact of rewinding is in stark contrast with previous work on networks initialized from scratch and bears closer examination. For example, does this finding hold across fine-tuning learning rates? How much does the value of the selected parameters change over time?
By Jonathan Frankle, Gintare Karolina Dziugaite, Daniel M. Roy, Michael Carbin
This paper analyzes the performance of several methods to prune networks at initialization, so even before training starts, to save on training time as the network is smaller or sparse (SNIP, GraSP, SynFlow, Magnitude pruning). The methods are allowed to sample the dataset to perform the pruning: this sampling can be considered negligible compared to the computation required for training. They compare the methods to two “upper bounds” representing the performance we can hope to achieve when given access to information that is available after training: Magnitude Pruning and Lottery Ticket Rewinding.
- All proposed methods are better than random pruning, but they are not sensitive to the individual selection weights, only to pruning proportions on each layer. Even worse, selecting the weights with the lowest instead of the highest value of the utility criteria improves performance on some methods (GraSP), which appears to invalidate some of the original works’ claims.
- The methods are far from competitive with post-training approaches. Moreover, none of the methods is SOTA in all settings: some methods are better at some sparsity levels than others, but this depends on sparsity.
- The methods yield better results if they are applied after a few training steps rather than right away at initialization, but they need a significant amount of training to approach the proposed “upper bounds”.
- The problem of finding a “good” subnetwork right at initialization seems somewhat under-defined and possibly overly difficult: which task or set of tasks is used to measure success? Is it even possible to find an ideal sub-networks that works on any possible task a priori? Consequently, it is hard to tell whether the mixed results stem from flaws in the methods or from the task’s inherent difficulty. More insights here would be particularly enlightening.
- The authors note that the studied methods prune “layers, not weights”, which may explain the surprising results they obtain by inverting the weight selection. In that case, would a dense model with adaptive layer sizes following the same patterns work as well?
- An interesting follow-up direction could be something along the lines of “pruning as soon as possible”. Recent “Bertology” work has shown that pre-trained models learn different levels of skill in sequence: we are particularly eager to see follow up work that explores the relationship between the emergence of these skills and the optimal pruning strategy.
Train Large then Compress, Rethinking Model Size for Efficient Training and Inference of Transformers
By Zhuohan Li, Eric Wallace, Sheng Shen, Kevin Lin, Kurt Keutzer, Dan Klein, Joseph E. Gonzalez
This paper explores the landscape of the computational tradeoffs between transformer models’ sizes and the required numbers of hyper-parameter settings and training steps to achieve a good performance. It finds larger sizes can allow for fewer hyper-parameter settings and training steps and offers some practical advice about choosing a larger initial number of parameters that can later be pruned to, counter-intuitively, reduce the overall computational cost of training a mode when compared to just training a smaller model from scratch.
- Large models are faster to train: they reach a given precision faster when measuring optimizing steps/wall clock time/ flops, even when they are stopped before convergence. Absolute size is more important than depth or width alone, but depth can be more important than width in some cases. The faster convergence usually makes up for the faster execution of smaller models.
- Large models can be compressed to smaller networks. Training large networks might speed up training but would lead to problems at inference time, as their resource cost is much higher. This work finds that pruning them to networks that end up containing fewer parameters than the original smaller alternatives still yields higher performance. They can be quantized too with less quantization error.
- Batch size has an influence on training speed. In practice, this means that gradient accumulation should be used for larger models.
- The results are impressive, but it can still be difficult to get some intuition for why the larger models converge to a better state faster and are easier to prune. The authors mention previous work hinting that deeper networks “promote movement along directions already taken” as a possible explanation, but we are definitely looking forward to reading further analysis.
- The connection to Lottery Ticket Hypothesis is mentioned only in passing. Further work exploring whether the sub-networks selected by the two approaches are similar in any fashion (such as by considering the Jaccard distance between the sets).
By Sara Hooker, Nyalleng Moorosi, Gregory Clark, Samy Bengio, Emily Denton
This paper sheds light on the impact of pruning on neural models for vision and shows that reported top-line accuracies often hide the disproportionate negative impact on certain classes of inputs. The paper connects this phenomenon to bias and fairness considerations.
- While the overall error is largely unchanged when a model is compressed (by pruning and quantization), there is a set of data that bears a disproportionately high portion of the error, with their accuracy falling by up to 50% while the overall performance only decreases by 1%, regardless of what the original accuracy was on the group.
- These examples (or at least some of them) can be consistently identified by comparing the predictions from a population of compressed models with the predictions from a population of non-compressed models on the same inputs: the examples where the predictions distributions diverge are called Compressed Identified Examples (CIE).
CIE often correspond to low-frequency patterns in the inputs. Compression cannibalizes performance on low-frequency patterns in order to optimize the performance on higher-frequency patterns and preserve the overall accuracy.
- Compression thus amplifies biases of models (amplifying certain errors on certain types of inputs). The authors suggest using CIE as an auditing tool for compressed models: surfacing a tractable subset of the data for further inspection by domain experts to assess this issue.
- This paper studies are pruning and quantization techniques that are run after training. One question that remains open is whether the models are facing an issue of modeling capacity (i.e., less-biased predictions require more representation power) or whether it is tied to the training procedure. Analyzing methods that reduce model size in the course of training or approaches such as gradual pruning would shed some light on this question.
- Would up-weighting the CIE examples in training lead to models that are more robust to compression? Or would we expect to find different CIE groups?
- The authors suggest using CIE as a diagnostic tool. What can be done with the diagnostic? Are there other calls to action from these insights? For instance, how could we change existing benchmarks on compression to include robustness metrics (i.e., adding another component to the tradeoff size vs. accuracy on CIE groups)?
Reading Group Discussion
The quantitative results obtained on many of the common benchmark tasks by pruning are impressive. At the same time, they also remind us how much we still have to learn about the training dynamics of neural networks. Common wisdom states that “overparameterization helps with optimization”, but we have little theory available to help us understand the phenomenon further, especially in the deep attention-based models that perform so well in NLP.
Each of the four papers above offers a different view of this question of modeling capacity vs. optimization vs. generalization.
The Lottery Ticket Hypothesis relies on the quality of the initial state of the parameters at least as much as on the evolution of the weight values during optimization. As such, the main purpose of increasing the number of parameters would be to exponentially increase the chances of hitting a good sub-network at initialization.
Other approaches focus more on how and whether the gradient flowing through the possibly redundant parameters help optimize the value of the ones we want to keep in the final pruned network: whether they try to evaluate that impact a priori as in the SynFlow algorithm or are content to simply keep them around for optimization based on their empirically proven efficiency and to prune them at the end of the training.
All of the works outlined above, however, assume that the neural networks are indeed over-parameterized and that they can be pruned without changing their qualitative behavior. The CIE work questions that assumption and finds that pruning does change the behavior of the model in non-trivial ways. This assessment also agrees with some experiments Victor Sanh has run on the task for natural language inference, gradually pruning a model trained on multiNLI and testing it on the HANS dataset. As the sparsity increases, the generalization as measured by the accuracy on the HANS test set decreases and gradually drops to 0 while the performance on the multiNLI test set stays mostly constant. Another experiment along those lines would be to see how much factual knowledge pre-trained language models lose as they are pruned (for example by monitoring closed-book QA accuracy for a model like T5).
The question remains whether this loss of generalization and increased bias is a result of the model losing “expressive capacity” as its number of parameters decreases or whether the fault lies in the compression strategies that aren’t quite flexible enough, but the results certainly suggest that a large number of parameters serves as more than a crutch for optimization.
Another question that is somewhat orthogonal to the one above is that of when to optimally prune weights from the model. Pruning early saves computation, but does not benefit from any signal from the target task. Pruning after training can take advantage of additional information but does not save any computation at training time or allow the parameters to adapt to the new sparsity pattern. Gradually pruning during training seems to provide the best of both worlds, but introduces a new set of hyper-parameters which may make optimization more costly. One should also keep in mind that actual computational gains will depend on the capabilities of current hardware and their ability to take full advantage of shifting sparsity patterns.
We’re very much looking forward to the progress on all of these questions that 2021 is sure to bring!
@HuggingFace : Sparsity and Pruning
We first started investigating ways to make existing models more computationally efficient with DistilBERT, a method which was used to train one of our most popular models. The follow-up on sequence-to-sequence models yielded DistilBart, which also reaches similar performances to their larger counterparts at a lesser cost. Recently, we have also investigated approaches which focus on sparsity more specifically.
Most of the works referenced above use magnitude pruning, a widely used strategy for pruning which thresholds weight values and simply sets the smallest ones to zero. In our work on Movement Pruning led by Victor Sanh, we argue that this approach is less effective in the context of transfer learning and highlight the importance of considering the changes of weights during fine-tuning as opposed to relying (mostly) on the pre-trained values. Code & hyper-parameters are available here.
Block Movement Pruning
The main drawback of unstructured pruning from a practical point of view is that current hardware can make it quite difficult to take full advantage of the sparsity pattern to accelerate the computation of the network. A compromise that can help alleviate this issue is the use of “semi-structured” sparsity patterns. By selecting blocks (typically 32x32) of weights instead of single weights, and running the same kind of optimization methods.
Accelerating block sparse linear algebra is easier, and the pytorch_block_sparse library developed at Hugging Face is our attempt to show that. We are pretty confident more and more solutions for block-sparsity computation will emerge, and we will be working with major actors to enable it. We are already providing some sample networks that take advantage of block sparsity, so you can judge by yourself!
Finally, we also work to combine block sparsity with other accelerated sparsity patterns such as NVidia Ampere, to further decrease the memory, the compute and the energy used by the neural networks that will be everywhere in the near future.