Hello,
I’m working with a RAG system that uses a Hugging Face model. However, the Llama models are too large to load locally. Is there a way to use the model’s API instead of loading it directly, and if so, how?
Hello,
I’m working with a RAG system that uses a Hugging Face model. However, the Llama models are too large to load locally. Is there a way to use the model’s API instead of loading it directly, and if so, how?
There are two main ways to do this. The Serverless Inference API is free to use, but it’s difficult to use it reliably. The Endpoint API is stable, but it’s not free.
There are also other services that use HF models with other companies’ APIs, but I don’t know much about them.
There is a Playground where you can actually use the Inference API, so you can try it out there.