Hi all,
I’m working on developing a model for production, but the machine I’ll use to perform the inferences doesn’t have a GPU (as usual in these cases). So, I’m testing different pytorchnative optimization strategies and I’m making a benchmark graph to make a decision about the best strategies to use. In case someone finds it interesting, this is the benchmark graph and below the graph you can find the description of each optimization. The model is a medium BERT with a LSTM layer on top. The xaxis represents the number of splits of 512tokens in each doc (it means if you have one document with 1024 tokens, this number would be 2. You can understand this number as a metric of the document length.)

Normal: No optimization applied.

Quantized: Dinamic Quantization applied. You can read more about this technique here. I don’t know why It ins’t working well, I’ve applied the process to the complete model, maybe I should be more selective with the layers to quantize…

N_threads: Decrease the number of threads using by PyTorch internally (128 by default) I’ve founded that 64 is a good number.

Prune: Basically it is a technique that apply a binary mask to model weights to converting some of them to zero (the strategy to select which weights to convert it is selected by user) You can read more bout this technique here
Any suggestions are welcome!