Inference service for large models, such as Vicuna 13b

rlancemartin · May 5, 2023, 3:46pm

Hi All!

I’d like API access to some of the new SOTA models, like Vicuna 13b.

Use this model with the Inference API

I copy over the code:

import requests

API_URL = "https://api-inference.huggingface.co/models/jeffwan/vicuna-13b"
headers = {"Authorization": "Bearer_xxx"}

def query(payload):
    response = requests.post(API_URL, headers=headers, json=payload)
    return response.json()

output = query({"inputs": "Can you please let us know more details about your ",})

I test with my API key, but see:

{'error': 'The model jeffwan/vicuna-13b is too large to be loaded automatically (26GB > 10GB). For commercial use please use PRO spaces (https://huggingface.co/spaces) or Inference Endpoints (https://huggingface.co/inference-endpoints).'}

I looked at deploying an inference endpoint (e.g., on T4 or A100).

It it is far too expensive for an individual developer: ~$5k / month.

Does HuggingFace host models such as vicuna-13b?

I was confused by the language Use this model with the Inference API but then The model jeffwan/vicuna-13b is too large to be loaded automatically, indicating that it does not work with the inference API.

Thank!

Topic		Replies	Views
Cannot run large models using API token Inference Endpoints on the Hub	5	7299	February 22, 2024
PRO Plan and for running huge models on free inference api? Beginners	1	1803	May 15, 2023
Issue with ALLaM-7B Model in Inference API - Size Limitation Error Inference Endpoints on the Hub	1	56	March 7, 2025
Inference Api ( serverless ) Endpoint Inference Endpoints on the Hub	0	456	April 24, 2024
List models accessible via InferenceClient? Inference Endpoints on the Hub	1	79	April 9, 2025

Inference service for large models, such as Vicuna 13b

Related topics