Scaling up BERT-like model Inference on modern CPU - Part 1

Matthieu · April 21, 2021, 9:31am

Hi community,

I have come through the nice article by @mfuntowicz : Hugging Face – The AI community building the future.

It sounds really interesting how easily you can benchmark your BERT transformer model with CLI and Facebook AI & Research’s Hydra configuration library.

Is it possible however to easily test it on cloud services as AWS and how to deploy it?

Thanks!

mfuntowicz · April 21, 2021, 1:37pm

Hey @Matthieu,

Thanks for reading and posting here .

Indeed, everything in the blog was run in AWS (c5.metal) instance(s).
The way I’m currently using it:

git clone https://github.com/huggingface/tune
cd tune
pip install -r requirements.txt
export PYTHONPATH=src
python src/main.py --multirun backend=pytorch batch=1 sequence_length=128,256,512

The overall framework is quite new and I’ll be improving the UX in the coming days, sorry for this user experience

Morgan

Matthieu · April 21, 2021, 1:58pm

Hi @mfuntowicz

Thanks again for the great article.

Indeed, everything in the blog was run in AWS (c5.metal) instance(s).
The way I’m currently using it:

git clone https://github.com/huggingface/tune

cd tune

pip install -r requirements.txt

export PYTHONPATH=src

python src/main.py --multirun backend=pytorch batch=1 sequence_length=128,256,512

So you ran these command lines directly from the c5.metal instance? You didn’t need to install before on it a docker image with OS and pip/python packages?
With this overall framework you can simulate different hardware parameters on the latency/throughput outputs. But, generally transformers are encaspulated within a docker image providing an API before deployment on cloud services. How could this benchmark could simulate real latency/throughput of the docker image deployed?

Matthieu

mfuntowicz · April 22, 2021, 1:54pm

So you ran these command lines directly from the c5.metal instance? You didn’t need to install before on it a docker image with OS and pip/python packages?

Yes, exactly.

With this overall framework you can simulate different hardware parameters on the latency/throughput outputs. But, generally transformers are encaspulated within a docker image providing an API before deployment on cloud services. How could this benchmark could simulate real latency/throughput of the docker image deployed?

That’s an interesting point. We do not provide a testbed for integrated solution (yet?). Still, all the knobs discussed in this first part and the ones comming in the second part are leverageable within a container and should see the same performances benefits highlighted in the blog posts.

Of course, it doesn’t simulate the latency overhead of a web server handling incoming requests and/or dynamic batching as would Nvidia Triton do for instance.

Hope it helps,
Morgan

Topic		Replies	Views
What is best way to serve huggingface model with API? Beginners	11	42347	August 29, 2023
Containerizing Huggingface Transformers for GPU inference with Docker and FastAPI 🤗Transformers	0	2968	October 5, 2021
Advice to speed and performance 🤗Transformers	4	7220	December 7, 2020
Can we parallelize transformers fine-tuning on a Hadoop cluster? 🤗Transformers	0	343	April 7, 2023
Slow inference using most recent docker image Amazon SageMaker	10	3196	March 21, 2022

Scaling up BERT-like model Inference on modern CPU - Part 1

Related topics