What is the difference of the model api and pipeline api?

Hello,
I tested two APIs to utilize the BGE-M3 embedding:

  1. "https://api-inference.huggingface.co/models/BAAI/bge-m3"
  2. "https://api-inference.huggingface.co/pipeline/feature-extraction/BAAI/bge-m3"

When using the first API, there is a noticeable load time before receiving the response.
However, when using the second API, the results are returned immediately.

What is the difference between these two APIs?
Also, does the second API always guarantee immediate responses?

I couldn’t find any related documentation or comparison materials, so I’m asking for clarification.

1 Like

The first option is a serverless API for inference. This can be slow and may fail if the model is not loaded. Pipelines are often optimized for speed and may keep the model loaded in memory, allowing for immediate responses to requests. While the pipeline API generally provides quicker responses due to its optimized architecture, it does not guarantee immediate responses in all scenarios. Factors such as server load, network latency, and the complexity of the input data can still affect response times. Therefore, while you can expect faster performance with the pipeline, it may not always be instantaneous.

1 Like