Inference service for large models, such as Vicuna 13b

Hi All!

I’d like API access to some of the new SOTA models, like Vicuna 13b.

I found jeffwan/vicuna-13b and I see:

Use this model with the Inference API

I copy over the code:

import requests

API_URL = "https://api-inference.huggingface.co/models/jeffwan/vicuna-13b"
headers = {"Authorization": "Bearer_xxx"}

def query(payload):
    response = requests.post(API_URL, headers=headers, json=payload)
    return response.json()

output = query({"inputs": "Can you please let us know more details about your ",})

I test with my API key, but see:

{'error': 'The model jeffwan/vicuna-13b is too large to be loaded automatically (26GB > 10GB). For commercial use please use PRO spaces (https://huggingface.co/spaces) or Inference Endpoints (https://huggingface.co/inference-endpoints).'}

I looked at deploying an inference endpoint (e.g., on T4 or A100).

It it is far too expensive for an individual developer: ~$5k / month.

Does HuggingFace host models such as vicuna-13b?

I was confused by the language Use this model with the Inference API but then The model jeffwan/vicuna-13b is too large to be loaded automatically, indicating that it does not work with the inference API.

Thank!