How to get Accelerated Inference API for T5 models?

Hi,

I just copy/paste the following codes in a Google Colab notebook with my TOKEN_API in order to check the inference time with the t5-base from the HF model hub.

Note: code inspiration from

import json
import requests

API_TOKEN = 'xxxxxxx' # my HF API token
headers = {"Authorization": f"Bearer {API_TOKEN}"}
API_URL = "https://api-inference.huggingface.co/models/t5-base"

def query(payload):
    data = json.dumps(payload)
    response = requests.request("POST", API_URL, headers=headers, data=data)
    return json.loads(response.content.decode("utf-8")), response.headers.get('x-compute-type')

And then, I run the following code in another cell of the same notebook:

%%time
data, x_compute_type = query(
    {
        "inputs": "Translate English to German: My name is Wolfgang.",
    }
)

print('data: ',data)
print('x_compute_type: ',x_compute_type)

I got the following output:

data: [{'translation_text': 'Übersetzen Sie meinen Namen Wolfgang.'}]
x_compute_type: cpu
CPU times: user 17.3 ms, sys: 871 µs, total: 18.1 ms
Wall time: 668 ms

When I launch a second time this cell, I got the following output that comes from the cache:

data: [{'translation_text': 'Übersetzen Sie meinen Namen Wolfgang.'}]
x_compute_type: cache
CPU times: user 16.7 ms, sys: 0 ns, total: 16.7 ms
Wall time: 180 ms

2 remarks:

  1. the x_compute_type is cpu, not cpu+optimized (see doc “Using CPU-Accelerated Inference (~10x speedup)”)
  2. this is confirmed by the inference time of about 700ms (inference time I get when I run model.generate() for T5 in a Google Colab notebook without the use of the inference API) that should be 70ms with Accelerated Inference API, no?
  3. even the cache inference time (nearly 200ms) is not really low even if it is almost 4 times less than the initial one.

How can I get Accelerated Inference API for a T5 model? Thanks.
cc @jeffboudier

1 Like

Just to confirm what I wrote in the first post of this thread, I did the same tests with InferenceApi from huggingface_hub.inference_api.

Indeed, the huggingface_hub library has a client wrapper to access the Inference API programmatically (doc: “How to programmatically access the Inference API”).

Therefore, I did run the following code in a Google Colab notebook:

!pip install huggingface_hub
from huggingface_hub.inference_api import InferenceApi

API_TOKEN = 'xxxxxxx' # my HF API token
model_name = "t5-base"

inference = InferenceApi(repo_id=model_name, token=API_TOKEN)
print(inference)

I got as output:

InferenceApi(options='{'wait_for_model': True, 'use_gpu': False}', headers='{'Authorization': 'xxxxxx'}', task='translation', api_url='https://api-inference.huggingface.co/pipeline/translation/t5-base')

Then, I ran the following code:

%%time
inputs = "Translate English to German: My name is Claude."
output = inference(inputs=inputs)
print(output)

And I got as output:

[{'translation_text': 'Mein Name ist Claude.'}]
CPU times: user 14 ms, sys: 1.05 ms, total: 15.1 ms
Wall time: 651 ms

When I ran a second time the same code, I got the cache output:

[{'translation_text': 'Mein Name ist Claude.'}]
CPU times: user 14.3 ms, sys: 581 µs, total: 14.9 ms
Wall time: 133 ms

We can observe that the inference times (initial and cache) correspond to those published in my first post (I guess this is normal because the code behind is the same). However, we end up with the same question: how can I get Accelerated Inference API for a T5 model?

1 Like

Hi @pierreguillou ,

Do you have a customer plan ? Optimizations are not on for non customers, leading to you not seeing the proper header.

Also keep in mind as mentioned in the docs, that for customers we’re usually able to go beyond the default depending on the load and requirements.

Cheers.

Hello @Narsil. If I understand well your answer, both CPU and GPU Accelerated Inference API are for paid plan (you call it “customer plan”, I’m right?) which are Pro Plan, Lab and Enterprise in the HF pricing page.

Contributor plan | Try Accelerated Inference API: CPU, no?

However, this not what is written in this HF pricing page. As you can see in the screen shot below, even a Contributor plan (I’ve got one at pierreguillou (Pierre Guillou)) can Try Accelerated Inference API: CPU.

As I did my first test with the T5 base model which is not optimized even in AWS SageMaker (check this post from @philschmid), I did another test with distilbert-base-uncased-distilled-squad. And, as for the T5 base model, this distilbert one is not CPU Accelerated through the Inference API.

You can check my Colab notebook HF_Inference_API.ipynb

About the the expression “customer plan” or “API customer”

I think the HF team should edit the paragraph Using CPU-Accelerated Inference (~10x speedup) with a clear definition (see screen shot below) and verify the HF pricing page, too.

Conclusion | Contributors on the HF model hub can not test CPU-Accelerated Inference API :frowning:

But after saying all that, the reality is that we (the model contributors on the HF model hub) can not test CPU-Accelerated Inference API. What a pity!

Note: I did not understand you last comment (see below the quote).

Also keep in mind as mentioned in the docs, that for customers we’re usually able to go beyond the default depending on the load and requirements.

1 Like

Hi @pierreguillou ,

Thanks for the detailed explanation. That’s indeed a miscommunication on our part. The acceleration part is indeed only available for customers (starting at the PRO plan).
We’re going to remove the Accelerated keyword as it’s definitely misleading right now.

For an estimate though you should expected 20% increased speed out of the box for T5.

What I was referring to when saying we’re usually able to go beyond the default means acceleration is something that is complex.
The default acceleration should work great out of the box, but more acceleration is usually possible given your constraints. We’re able to adapt hardware and software to maximize efficiency.
T5 is a great example as it’s able to do translation, summarization and other tasks. If you are using it as a summarization of very long texts, or in translating short sentences, we can optimize it differently and get different performance boosts. So the default cpu+optimized you will see might not reflect the ultimate performance we can reach if you want to use the API in your products.

If you are interested in using the API at scale, we’re usually recommending starting a discussion at api-enterprise@huggingface.co where we can ask a few questions regarding the intended use (model+ typical usage) and from there we can optimize a model specific to your usage.

Does that answer your question better ?

1 Like

Thank you so much @Narsil for your detailed response!
I will contact you by email.