CPU Optimization PyTorch Strategies

amlarraz · January 14, 2022, 3:36pm

Hi all,

I’m working on developing a model for production, but the machine I’ll use to perform the inferences doesn’t have a GPU (as usual in these cases). So, I’m testing different pytorch-native optimization strategies and I’m making a benchmark graph to make a decision about the best strategies to use. In case someone finds it interesting, this is the benchmark graph and below the graph you can find the description of each optimization. The model is a medium BERT with a LSTM layer on top. The x-axis represents the number of splits of 512-tokens in each doc (it means if you have one document with 1024 tokens, this number would be 2. You can understand this number as a metric of the document length.)

Normal: No optimization applied.
Quantized: Dinamic Quantization applied. You can read more about this technique here. I don’t know why It ins’t working well, I’ve applied the process to the complete model, maybe I should be more selective with the layers to quantize…
N_threads: Decrease the number of threads using by PyTorch internally (128 by default) I’ve founded that 64 is a good number.
Prune: Basically it is a technique that apply a binary mask to model weights to converting some of them to zero (the strategy to select which weights to convert it is selected by user) You can read more bout this technique here

Any suggestions are welcome!

amlarraz · February 1, 2022, 12:32pm

Hi all,

I’ve made some progress. First of all, I want to explain the new things I’ve learnt about the number of threads.

There are two different possible multiprocesses in PyTorch, “intra_op” and “inter_op”. The difference between them are the elements involved in the operations. If I’ve understood it correctly, “intra_op” refers to the operations between elements in the same minibatch and “inter_op” refers to the operations between elements in different minibatches. You can read about that here. So, I’ve separated the election of number of threads in two “intra_op” and “inter_op”. Blelow you can see the new bechmark graph:

Topic		Replies	Views
Number of Inter and Intra-ops threads used by BERT models 🤗Transformers	0	1054	August 15, 2022
Optimum Pruning and Quantization Current Limitation 🤗Transformers	4	977	April 26, 2022
Pass CPU cores to speed up inference 🤗Optimum	1	3071	June 14, 2022
Optimize response time of model output 🤗Transformers	0	674	December 23, 2021
Dataset.map stuck with `torch.set_num_threads` set to 2 or larger Beginners	1	1669	May 2, 2023

CPU Optimization PyTorch Strategies

Related topics