I want to do zero-shot text classification either this model [1] (711 MB) or something similar:
Want to achieve high throughput in classification requests per second.
The classification will run on low-end hardware: some Hetzner [2] machine without GPU (Hetzner is great, reliable and cheap they just do not have GPU machines), something like this:
CCX13: Dedicated vCPU, 2 VCPU, 8 GB RAM
CX32: Shared vCPU, 4 VCPU, 8 GB RAM
Now there are multiple options for deploying and serving LLMs:
lmdeploy
text-generation-inference
TensorRT-LLM
vllm
There are more and more new frameworks for this. I am a bit lost.
Would you suggest the best option for deploying the above-listed model (No-GPU hardware)?
[1] MoritzLaurer/roberta-large-zeroshot-v2.0-c · Hugging Face
[2] https://www.hetzner.com/cloud/
P.S. I know about this post:
These recomendations - they’re great. But, uh, it’s been four years