I get the feeling that I might miss something about the perfomance and speed and memory issues using huggingface transformer. Since, I like this repo and huggingface transformers very much (!) I hope I do not miss something as I almost did not use any other Bert Implementations. Because I want to use TF2 that is why I use huggingface.
Now, I would like to speed up inference and maybe decreasing memory usage.
As I am native tensorflow user, I have no experience with the pytorch models at all.
So is it possible that the pytorch models are more performant and more efficient than the tf models?
*How can I speed up inference ? For encoding 200 sentences pairs on my cpu it takes 12 seconds.
So is it more feasible to use pytorch models for making inference or even training?
Are there any memory usage differences?
*So, why is bert-as-a-service more performant and faster (as it looks like) I hope I can test this?
I ask because I stumpled over here:
Some advices for better usage (for deployment) are very appreciated.
As thomwolf said, we are currently working on a much performant version of the TF models and then for now, yes, the PyTorch models are more optimized than the TF ones.
Bert-as-a-service works faster because it is highly optimized for inference by making:
BERT using mixed precision (thing that we can do as well)
freeze the model
powerful/scalable service API
If you are looking for a performant inference for your TF model I suggest you to take a look at ONNX, we provide a script to create your own optimized ONNX model in the repo. And afterwards you can run a Triton server to provide an over your model.