Batch inference with huggingface_hub for serverless providers

Hi everyone,

I have recently started using batch inference for OpenAI and Google Gemini to reduce token costs (see Batch inference with Gemini  |  Generative AI on Vertex AI  |  Google Cloud Documentation ). Is there an equivalent to this on HuggingFace serverless providers via huggingface_hub? Thank you!

1 Like

Whether batch inference is supported depends on the endpoint. I don’t think many HF endpoints support it…


Short, direct answer first, then details:

  • There is no direct equivalent of OpenAI’s Batch API or Vertex AI’s Gemini batch prediction (with special discounted pricing and job objects) on Hugging Face serverless Inference Providers via huggingface_hub.

  • Hugging Face serverless (HF Inference + other Providers) is priced as a normal online API: you pay for compute time per request, not a separate “batch mode” with cheaper tokens.(Hugging Face)

  • If you want “batch-like” behavior and cost reductions, you have to use:

    • Offline jobs (HF Jobs or your own infra) with open models, plus batching for throughput.(Hugging Face)
    • Cheaper models / providers.
    • Client-side batching + token-reduction strategies.

Below, I’ll spell out:

  1. What OpenAI / Gemini batch actually give you (so we can compare).

  2. How Hugging Face serverless Inference Providers work.

  3. Concrete ways to reduce cost with Hugging Face:

    • Within serverless Providers.
    • Using HF Jobs for offline batch inference.
    • Using your own vLLM / TEI / local endpoints.
    • General token-usage reduction strategies.

1. What you’re doing with OpenAI / Gemini batch

With OpenAI and Gemini:

  • OpenAI Batch API:

    • You upload a big JSONL file of requests.
    • OpenAI processes them asynchronously over up to 24 hours.
    • You get ~50% discount on input and output tokens vs synchronous API.(OpenAI)
    • Output is stored; you later download results.
  • Vertex AI Gemini Batch prediction:

    • Similar idea on Google Cloud: you submit a batch prediction job with a dataset (BigQuery, GCS file, etc.).
    • Processing is asynchronous with relaxed latency constraints.
    • Pricing is separate from the real-time online endpoints and often cheaper for large offline workloads (your Vertex docs link describes this model).

So your mental model now is:

“I have a large offline workload. I can pay less per token by submitting it as a batch job instead of hammering the online API.”

That’s exactly what we need to compare against on Hugging Face.


2. How Hugging Face serverless Inference Providers work

Hugging Face’s Inference Providers layer is a unified router over many serverless providers (HF Inference, Together, SambaNova, etc.), exposed via the InferenceClient and OpenAI-compatible endpoints.(Hugging Face)

Key points:

  • HF Inference is the serverless Inference API that used to be called “Inference API (serverless)”. It’s now just one provider (“hf-inference”) behind the router.(Hugging Face)

  • Pricing model for HF Inference:

    • There is a free tier.
    • After that, you pay per request based on compute time × hardware price.(Hugging Face)
    • Docs explicitly describe this as a serverless online service, not as a batch job system.
  • Other Inference Providers (Together, SambaNova, etc.) are also billed pay-as-you-go. HF’s own forum answers state that “usage fees for each Inference Provider’s endpoint would apply directly.”(Hugging Face Forums)

Important for your question:

  • The Inference Providers docs and pricing pages do not define any “Batch API”, “batch jobs”, or special discounted batch pricing.
  • The InferenceClient reference provides methods like chat_completion, text_generation, embeddings, etc., but again no create_batch_job / get_batch_result API.(Hugging Face)

So:

For serverless Inference Providers, Hugging Face only exposes online (synchronous or streaming) APIs, not an asynchronous batch job API with separate discounted pricing.

There is no 1:1 equivalent of “OpenAI Batch API” or “Vertex Batch prediction” at the Provider-router level.


3. Options to reduce cost with Hugging Face

Even though there is no direct batch-discount API, you can still reduce effective cost per token with a combination of model choice, infrastructure choice, and usage patterns.

Think of three levels:

  1. Within serverless Providers: what you can do while staying purely “serverless”.
  2. Offline jobs on Hugging Face: HF Jobs for batch inference (closest conceptual match to Vertex batch).
  3. Self-host / third-party infra: vLLM / TEI / other offline inference, still using HF models.

I’ll walk through each, plus general token-saving techniques.


3.1 Inside serverless Providers: efficiency, not discounts

Within the serverless Providers themselves, you mainly have two levers:

3.1.1 Choose cheaper models and cheaper providers

HF Inference pricing is compute-based: hardware-time × price.(Hugging Face)

Implications:

  • Smaller / more efficient models (e.g. 1–8B LLMs, distilled models, specialized classifiers) will:

    • Run faster (less compute time).
    • Use cheaper underlying hardware (e.g. T4 vs A100).
    • So they cost less per token even without a batch discount.

You can:

  • Use the model selector on huggingface.co and check which Providers are available and what their prices are.(Hugging Face)

  • In the router, you can choose:

    • A specific model, often with suffixes like :fastest, :cheapest, depending on the provider’s naming scheme.
    • A specific provider (e.g. "hf-inference") vs a more expensive one.

This is analogous to switching from GPT-4 Turbo to GPT-4o-mini: you’re changing the model as your main cost lever.

3.1.2 Client-side micro-batching and concurrency

This does not change the price per token, but it can:

  • Increase throughput.
  • Reduce wall-clock time.
  • Potentially lower effective cost if you’re billed in coarse blocks of compute time by the provider.

Two main approaches:

  1. Batch multiple inputs into a single call, if the task supports it (e.g. some classification or embedding endpoints accept lists).

    Example pattern:

    from huggingface_hub import InferenceClient
    
    # See docs: https://huggingface.co/docs/huggingface_hub/en/package_reference/inference_client  # noqa: E501
    
    client = InferenceClient(provider="hf-inference")  # or another Provider
    
    texts = [
        "I love this product!",
        "This is terrible...",
        "It's ok, not great.",
    ]
    
    # Example: some text classification models allow list inputs
    preds = client.text_classification(
        texts,
        model="cardiffnlp/twitter-roberta-base-sentiment-latest",
    )
    
    for t, p in zip(texts, preds):
        print(t, "→", p[0]["label"], p[0]["score"])
    

    Not every method supports list inputs; it depends on the task/model.

  2. Use AsyncInferenceClient to run many requests in parallel with a controlled concurrency limit:

    import asyncio
    from huggingface_hub import AsyncInferenceClient
    
    # See docs: https://huggingface.co/docs/huggingface_hub/en/package_reference/inference_client  # noqa: E501
    
    client = AsyncInferenceClient(provider="hf-inference")
    
    prompts = [
        {"role": "user", "content": "Explain batch inference in one sentence."},
        {"role": "user", "content": "List three ways to save tokens."},
        # ...
    ]
    
    async def run_one(msg):
        resp = await client.chat_completion(
            model="meta-llama/Llama-3.2-3B-Instruct:fast",  # example cheap-ish model
            messages=[msg],
            max_tokens=256,
        )
        return resp.choices[0].message.content
    
    async def main():
        sem = asyncio.Semaphore(10)  # tune to respect rate limits
    
        async def guarded(m):
            async with sem:
                return await run_one(m)
    
        results = await asyncio.gather(*(guarded(p) for p in prompts))
        for p, r in zip(prompts, results):
            print(p["content"], "→", r[:80], "...")
    
    asyncio.run(main())
    

Again: this improves throughput and utilization, not pricing tiers. But if your load is modest and you just want to process offline datasets faster, this might already be “good enough” without leaving serverless.


3.2 HF Jobs: closest analogue to Vertex batch prediction

If you want something that genuinely looks like “batch jobs”, you should look at Hugging Face Jobs, which is a separate feature from Providers.

From the official Jobs docs and CLI guide:(Hugging Face)

  • HF Jobs lets you run containerized code (e.g. Python scripts) on HF-managed CPUs/GPUs.
  • The docs explicitly list “Batch Inference: Run offline inference on thousands of samples using optimized GPU setups” as a primary use case.
  • Other use cases include training, data processing, and synthetic data generation.

Conceptually, HF Jobs behaves like:

“Docker run on Hugging Face GPUs” with job scheduling and logs.

How this differs from serverless Providers:

  • You are not billed per Provider request; you’re billed for compute instance time (e.g. a10g-small for N hours).
  • You pull models from the Hub (or your private repos) and run inference in your code.
  • You control batching (batch size, data loader, etc.), so you can approach the theoretical throughput of the hardware.

This is much closer to Vertex batch prediction than Providers are.

Very rough workflow:

  1. Put your batch inference script in a repo:

    # repo structure
    batch_infer/
      - run.py
      - requirements.txt
    
  2. In run.py, write your offline inference loop using transformers, vLLM, or TEI, reading from a dataset and writing out predictions.

  3. Launch a job:

    hf jobs run \
      --name my-batch-infer \
      --flavor a10g-small \
      python:3.12 \
      bash -lc "pip install -r batch_infer/requirements.txt && python batch_infer/run.py"
    
    # docs: https://huggingface.co/docs/huggingface_hub/en/guides/jobs  # noqa: E501
    

Advantages for “token cost”:

  • By running on your own job, you can:

    • Use large batches (e.g. 64–512 prompts per forward pass) with vLLM/TEI.
    • Achieve much higher throughput per GPU-hour than serverless, because you control batching and scheduling.
    • That translates into a lower effective cost per token (compute cost divided by total tokens processed), especially for large offline piles of data.

The HF Jobs docs and related blog posts show examples of this, including large-scale LLM batch inference with vLLM.(Hugging Face)


3.3 Self-hosted vLLM / TEI / local endpoints (still using HF models)

HF’s general inference guide says you can point InferenceClient to a local or remote server running vLLM, TGI, llama.cpp, etc., as long as it uses an OpenAI-compatible API.(Hugging Face)

Example pattern from the docs:

from huggingface_hub import InferenceClient

# Example server: vLLM / TGI / litellm proxy exposing OpenAI-style /v1/chat/completions
# See: https://huggingface.co/docs/huggingface_hub/en/guides/inference  # noqa: E501

client = InferenceClient(model="http://localhost:8080")

resp = client.chat_completion(
    model="meta-llama/Llama-3.2-3B-Instruct",
    messages=[{"role": "user", "content": "Short answer: why batch inference?"}],
    max_tokens=128,
)
print(resp.choices[0].message.content)

With something like vLLM:

  • The docs describe an offline inference mode where you create a LLM object, load an HF model, and pass lists of prompts, with strong throughput as batch size increases.(docs.vllm.ai)
  • Benchmarks show vLLM significantly outperforming plain Transformers for offline inference, especially for large batches.

You can deploy this:

  • On your own infra (Kubernetes, bare-metal, etc.).
  • On other clouds (GCP / AWS / Azure) using managed job systems (Cloud Run Jobs, SageMaker Processing, etc.).
  • Or inside HF Jobs, combining the job system with vLLM for maximum throughput.

Cost-wise:

  • You pay for compute (GPU/CPU) directly (cloud or HF Jobs).
  • If you keep the GPU highly utilized with large batches, your effective cost per token is often much lower than serverless.

This is essentially “build your own Vertex Batch but using open models from Hugging Face and a general-purpose job system.”


3.4 General token-usage reduction strategies (work regardless of provider)

These are independent of whether you use serverless Providers, HF Jobs, or your own infra. They directly reduce tokens → reduce cost.

  1. Shorter prompts and contexts:

    • Remove redundant instructions; cache and re-use an initial system prompt where possible.
    • For retrieval-augmented setups, aggressively deduplicate and truncate context.
    • Use smaller max context LLMs if full long-context is not needed.
  2. Control max_new_tokens / max_tokens:

    • Set hard caps consistent with your task (e.g. 128–256 tokens for most summarization / classification tasks versus unlimited).
    • This alone can halve output token usage.
  3. Specialized small models instead of big general LLMs:

    • For classification, sentiment, extraction, etc., use smaller Transformers or specific finetunes (e.g. sentiment, NER, topic classifiers) from HF; these are cheaper and faster than large chat models.
  4. Embeddings:

    • Use smaller embedding models (e.g. TEI-hosted 256–768-dim embeddings) instead of huge general models.
    • TEI is designed for efficient, high-throughput embedding generation, often with better cost-per-vector for large batches.
  5. Caching and deduplication:

    • Store model outputs keyed by (model, prompt, settings).
    • For periodic jobs (e.g. daily summarization), only re-run on new content; reuse prior results.

Even if HF does not give you a “batch token discount,” these simple steps can easily cut your total token usage by 2–5×, which usually dominates any 50% discount you get from Batch in other ecosystems.


4. Putting it together: mapping your current pattern to HF

Given how you’re already using OpenAI and Gemini batch, here’s a practical mapping to Hugging Face:

  1. If you want a pure serverless, “no infra” approach:

    • Accept that there is no discounted batch pricing.
    • Use smaller, cheaper models via Providers.
    • Use AsyncInferenceClient to run high-throughput offline jobs (e.g., reading prompts from a file and writing out results), but treat this as “my own batch script using a real-time API”, not a special Batch API.
  2. If you want a true batch-optimized workflow (closest to Vertex Batch):

    • Move the offline workloads to HF Jobs:

      • Run a Python script that fetches an open model from Hugging Face (or your private repo).
      • Use vLLM / TGI / Transformers with large batches.
      • Process your dataset offline inside the job.
    • Pay for GPU time instead of Provider tokens, and tune batch size and scheduling for maximum throughput.

  3. If you already have cloud infra (GCP / AWS / etc.):

    • You can emulate Vertex Batch by:

      • Running vLLM/TEI with HF models on your infra.
      • Using your cloud’s job system (Cloud Run Jobs, SageMaker Processing, etc.).
    • Then call these endpoints from your existing pipelines in almost exactly the same way as you do today with Vertex batch jobs.

In other words:

  • No direct Batch API / discount on HF serverless Providers.

  • Yes, you can absolutely get “batch-style” cost savings by:

    • using cheaper models/providers,
    • shifting large offline work to HF Jobs (or your own vLLM/TEI deployment) and
    • applying classic token-usage optimization techniques.
1 Like

Thanks for pointing me in the right direction @John6666 ! It seems that at least NOVITA does support batch API according to this recent blogpost: Batch API: Reduce Bandwidth Waste and Improve API Efficiency

1 Like

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.