Hi,
I just copy/paste the following codes in a Google Colab notebook with my TOKEN_API in order to check the inference time with the t5-base from the HF model hub.
Note: code inspiration from
import json
import requests
API_TOKEN = 'xxxxxxx' # my HF API token
headers = {"Authorization": f"Bearer {API_TOKEN}"}
API_URL = "https://api-inference.huggingface.co/models/t5-base"
def query(payload):
data = json.dumps(payload)
response = requests.request("POST", API_URL, headers=headers, data=data)
return json.loads(response.content.decode("utf-8")), response.headers.get('x-compute-type')
And then, I run the following code in another cell of the same notebook:
%%time
data, x_compute_type = query(
{
"inputs": "Translate English to German: My name is Wolfgang.",
}
)
print('data: ',data)
print('x_compute_type: ',x_compute_type)
I got the following output:
data: [{'translation_text': 'Ăśbersetzen Sie meinen Namen Wolfgang.'}]
x_compute_type: cpu
CPU times: user 17.3 ms, sys: 871 µs, total: 18.1 ms
Wall time: 668 ms
When I launch a second time this cell, I got the following output that comes from the cache:
data: [{'translation_text': 'Ăśbersetzen Sie meinen Namen Wolfgang.'}]
x_compute_type: cache
CPU times: user 16.7 ms, sys: 0 ns, total: 16.7 ms
Wall time: 180 ms
2 remarks:
- the x_compute_type is
cpu
, notcpu+optimized
(see doc “Using CPU-Accelerated Inference (~10x speedup)”) - this is confirmed by the inference time of about 700ms (inference time I get when I run model.generate() for T5 in a Google Colab notebook without the use of the inference API) that should be 70ms with Accelerated Inference API, no?
- even the cache inference time (nearly 200ms) is not really low even if it is almost 4 times less than the initial one.
How can I get Accelerated Inference API for a T5 model? Thanks.
cc @jeffboudier