Scaling up BERT-like model Inference on modern CPU - Part 1

Hi community,

I have come through the nice article by @mfuntowicz : Hugging Face – The AI community building the future.

It sounds really interesting how easily you can benchmark your BERT transformer model with CLI and Facebook AI & Research’s Hydra configuration library.

Is it possible however to easily test it on cloud services as AWS and how to deploy it?

Thanks!

2 Likes

Hey @Matthieu,

Thanks for reading and posting here :slight_smile:.

Indeed, everything in the blog was run in AWS (c5.metal) instance(s).
The way I’m currently using it:

  • git clone https://github.com/huggingface/tune
  • cd tune
  • pip install -r requirements.txt
  • export PYTHONPATH=src
  • python src/main.py --multirun backend=pytorch batch=1 sequence_length=128,256,512

The overall framework is quite new and I’ll be improving the UX in the coming days, sorry for this user experience :innocent:

Morgan :hugs:

1 Like

Hi @mfuntowicz

Thanks again for the great article.

Indeed, everything in the blog was run in AWS (c5.metal) instance(s).
The way I’m currently using it:

  • git clone https://github.com/huggingface/tune
  • cd tune
  • pip install -r requirements.txt
  • export PYTHONPATH=src
  • python src/main.py --multirun backend=pytorch batch=1 sequence_length=128,256,512
  1. So you ran these command lines directly from the c5.metal instance? You didn’t need to install before on it a docker image with OS and pip/python packages?
  2. With this overall framework you can simulate different hardware parameters on the latency/throughput outputs. But, generally transformers are encaspulated within a docker image providing an API before deployment on cloud services. How could this benchmark could simulate real latency/throughput of the docker image deployed?

Matthieu

  1. So you ran these command lines directly from the c5.metal instance? You didn’t need to install before on it a docker image with OS and pip/python packages?

Yes, exactly.

  1. With this overall framework you can simulate different hardware parameters on the latency/throughput outputs. But, generally transformers are encaspulated within a docker image providing an API before deployment on cloud services. How could this benchmark could simulate real latency/throughput of the docker image deployed?

That’s an interesting point. We do not provide a testbed for integrated solution (yet?). Still, all the knobs discussed in this first part and the ones comming in the second part are leverageable within a container and should see the same performances benefits highlighted in the blog posts.

Of course, it doesn’t simulate the latency overhead of a web server handling incoming requests and/or dynamic batching as would Nvidia Triton do for instance.

Hope it helps,
Morgan