- So you ran these command lines directly from the c5.metal instance? You didn’t need to install before on it a docker image with OS and pip/python packages?
- With this overall framework you can simulate different hardware parameters on the latency/throughput outputs. But, generally transformers are encaspulated within a docker image providing an API before deployment on cloud services. How could this benchmark could simulate real latency/throughput of the docker image deployed?
That’s an interesting point. We do not provide a testbed for integrated solution (yet?). Still, all the knobs discussed in this first part and the ones comming in the second part are leverageable within a container and should see the same performances benefits highlighted in the blog posts.
Of course, it doesn’t simulate the latency overhead of a web server handling incoming requests and/or dynamic batching as would Nvidia Triton do for instance.
Hope it helps,