Just to confirm what I wrote in the first post of this thread, I did the same tests with InferenceApi
from huggingface_hub.inference_api
.
Indeed, the huggingface_hub
library has a client wrapper to access the Inference API programmatically (doc: “How to programmatically access the Inference API”).
Therefore, I did run the following code in a Google Colab notebook:
!pip install huggingface_hub
from huggingface_hub.inference_api import InferenceApi
API_TOKEN = 'xxxxxxx' # my HF API token
model_name = "t5-base"
inference = InferenceApi(repo_id=model_name, token=API_TOKEN)
print(inference)
I got as output:
InferenceApi(options='{'wait_for_model': True, 'use_gpu': False}', headers='{'Authorization': 'xxxxxx'}', task='translation', api_url='https://api-inference.huggingface.co/pipeline/translation/t5-base')
Then, I ran the following code:
%%time
inputs = "Translate English to German: My name is Claude."
output = inference(inputs=inputs)
print(output)
And I got as output:
[{'translation_text': 'Mein Name ist Claude.'}]
CPU times: user 14 ms, sys: 1.05 ms, total: 15.1 ms
Wall time: 651 ms
When I ran a second time the same code, I got the cache output:
[{'translation_text': 'Mein Name ist Claude.'}]
CPU times: user 14.3 ms, sys: 581 µs, total: 14.9 ms
Wall time: 133 ms
We can observe that the inference times (initial and cache) correspond to those published in my first post (I guess this is normal because the code behind is the same). However, we end up with the same question: how can I get Accelerated Inference API for a T5 model?