Accelerated Inference API Automatic Speech Recognition

Hi I’m trying to use the Automatic Speech Recognition API but the docs are … light.

When I copy/paste the example code from the docs (below for convenience):

import json

import requests

headers = {"Authorization": f"Bearer {API_TOKEN}"}
API_URL = ""

def query(filename):
    with open(filename, "rb") as f:
        data =
    response = requests.request("POST", API_URL, headers=headers, data=data)
    return json.loads(response.content.decode("utf-8"))

data = query("sample1.flac")

… I get …

  "error": "Model facebook/wav2vec2-base-960h is currently loading",
  "estimated_time": 20

It then says in the docs “no other parameters are currently allowed”. Does this mean I can’t ask it to use a GPU for instance?


  1. It’d be nice if the docs had sample code that worked out of the box. Developer UX is important.
  2. It’d be nice if the docs also had documentation on the response format. For instance even the error result: estimated time: 20 is this minutes, days, centuries, nanoseconds?
  3. A very common ASR feature is to have word-by-word timestamps for alignment use cases. Does this API support that or in any way harmonise the ASR engines underneath (SpeechBrain and another one)

I’m poised to shell out some big bucks for GPU-level support at HF but I need to see much more pro-level docs in this area.

@boxabirds I got same problem. looks like no-one responded to your post… did you work out the error?

The 503 error happens while the model is loading. The 20 is, i believe, seconds, but I find this to be generally inaccurate. I have a while loop with some logging that’s essentially:

while response.status_code == 503:
    response = requests.request("POST", API_URL, headers=headers, params={"wait_for_model": True}, data=data)

For the ASR model I was using, it took almost 3 minutes from cold start to first response. After that, everything goes smoothly.