Advice to speed and performance


I get the feeling that I might miss something about the perfomance and speed and memory issues using huggingface transformer. Since, I like this repo and huggingface transformers very much (!) I hope I do not miss something as I almost did not use any other Bert Implementations. Because I want to use TF2 that is why I use huggingface.

Now, I would like to speed up inference and maybe decreasing memory usage.

As I am native tensorflow user, I have no experience with the pytorch models at all.

  • So is it possible that the pytorch models are more performant and more efficient than the tf models?
    *How can I speed up inference ? For encoding 200 sentences pairs on my cpu it takes 12 seconds.
  • So is it more feasible to use pytorch models for making inference or even training?
    Are there any memory usage differences?
    *So, why is bert-as-a-service more performant and faster (as it looks like) I hope I can test this?

I ask because I stumpled over here:

Some advices for better usage (for deployment) are very appreciated.

Is huggingface with pytorch faster than with tensorlfow?

@jplu is currently working on making the TF2 models a lot faster!
The situation should be better soon (still in a few weeks probably).

Hello !

As thomwolf said, we are currently working on a much performant version of the TF models and then for now, yes, the PyTorch models are more optimized than the TF ones.

Bert-as-a-service works faster because it is highly optimized for inference by making:

  • BERT using mixed precision (thing that we can do as well)
  • freeze the model
  • powerful/scalable service API

If you are looking for a performant inference for your TF model I suggest you to take a look at ONNX, we provide a script to create your own optimized ONNX model in the repo. And afterwards you can run a Triton server to provide an over your model.

1 Like

Hey, thank you. I think I could also use a finetuned TF model (with huggingface) and use it with bert-as-aservice?