Is it possible to run a paid inference API at huggingface for large LLMs? Like I want to evaluate a few custom datasets on BLOOM/OPT models. I understand that it’s not possible for huggingface to provide thousands of free API calls. But can I just pay for it and get inference output on publicly available models at huggingface?
cc @jeffboudier
Thanks for asking! To evaluate models beyond the free tier of the Inference API, you can use our paid inference solution Inference Endpoints and deploy any model on dedicated infrastructure for your use (billed by capacity, not by requests). Note that for >10Bn models the available hardware instances may not fit the model (e.g. BLOOM 176Bn, OPT 175Bn), which may require you to request quota for large instances, or a custom quote / deployment.
Thanks a lot for the fast reply.
It seems to me that right now it’s not possible to deploy the 176B model according to this page. Or is there any other way we can do that?
I can already see that the BLOOM is deployed at azure, bigscience/bloom · Hugging Face.
Is it on CPU?
Right now as a researcher, I only need inference results for a few datasets which may not require more than a couple of hours each. Is it ok to request a custom quote/deployment for that?
Note that I want to host “google/flan-t5-xxl” (11B), “bigscience/bloomz” (176B) or OPT (175B) (depending on perf.). But not sure which device to use since A100 nodes are not available and judgding by the p4de.24xlarge
’s availability at aws, not sure when we will get that.
@jeffboudier Any suggestion on the device selection for “google/flan-t5-xxl” (11B), “bigscience/bloomz” (176B) and OPT (175B) models?
Hi - for 11B model you can check out this tutorial to run T5-11B on a T4 via Inference Endpoints: Deploy T5 11B for inference for less than $500
BLOOMZ and OPT would require a 8xA100 80GB (e.g. p4de.24xlarge on AWS) instance which wouldn’t make sense to setup / grant for a couple hours, maybe your usage falls within the free limit of the Inference API? Azure is sponsoring free inference for BLOOM (not BLOOMZ) in the Inference API - hosted on a 8xA100 80GB.
Thanks a lot for the suggestion. @jeffboudier