Cannot run large models using API token

AndreaSottana · February 16, 2023, 1:32pm

Hello,

I am having the following two issues.

I cannot run large models using the inference API. For example, if I run the following,

import requests

API_URL = "https://api-inference.huggingface.co/models/EleutherAI/gpt-neox-20b"
headers = {"Authorization": "Bearer <MY_API_KEY_HERE>"}

def query(payload):
	response = requests.post(API_URL, headers=headers, json=payload)
	return response.json()
	
output = query({
	"inputs": "Can you please let us know more details about your ",
})
print(output)

I get this error

{'error': 'Model EleutherAI/gpt-neox-20b is currently loading', 'estimated_time': 1651.7474365234375}

Why does this happen, and is there a way around the issue?

Even for the smaller models I do manage to run successfully, the output is different from the one generated in the user interface, for example the code below

import requests

API_URL = "https://api-inference.huggingface.co/models/EleutherAI/gpt-neo-2.7B"
headers = {"Authorization": "Bearer <MY_API_KEY_HERE>"}

def query(payload):
	response = requests.post(API_URL, headers=headers, json=payload)
	return response.json()
	
output = query({
	"inputs": "Can you please let us know more details about your ",
})
print(output)

generated the following

[{'generated_text': 'Can you please let us know more details about your  \nschedule.\n\nThanks,   \nLiz Taylor  \n\n-----Original Message----- \nFrom: Dasovich, Jeff [mailto:Jeff.D'}]

but on the website it is different (see below)

Why is this the case? Is there a way to ensure the outputs using the free Inference API are more aligned with those of the web UI?

Thanks

radames · February 16, 2023, 7:34pm

hi @AndreaSottana , that is a very large model, it takes a long time to load on our API inference.
Our API inference is suitable for testing and evaluation. If you’re looking for less latency you probably need our dedicated service Inference Endpoints

You can read more about how the hub inference API works here

AndreaSottana · February 17, 2023, 10:22am

Hi @radames

Many thanks for letting me know. I have two follow up questions:

Is there a way to load one of your large models locally for inference using model parallelism without having to manually edit the model’s internal code? I have access to multiple GPUs, and a single GPU with 24GB cannot load a 20 billion parameters model in memory, but with multiple GPUs it would be possible, however when I load the model it only tries to load it onto a single GPU and then gives an out of memory error.
Why are the results given by the UI when I run a model on the hub non-deterministic, but when I download it locally and run inference it appears to be deterministic and give a different output than the one I get on the hub? I am talking mostly about generative models.

Thanks a lot

pdeubel · March 8, 2023, 11:35am

I am also unable to run a large model on the inference API, specifically Salesforce/codegen-16B-mono. Neither trying it out using the widget on the website nor using a REST request through python works. In both cases I get a time out, for example the widget gives the following output after some time: Model Salesforce/codegen-16B-mono time out.

Is that because the model is too big or because something in the backend is broken for that model. For the latter case, should I ask the model’s authors for help?

mandelakori · February 22, 2024, 2:38pm

Hi,

This makes sense however these two models:
zephyr
aisak-assistant
are the exact same size, but the former is able to run on inference api, wheras the latter cannot. Could you please provide me with a solution/explanation?

Thanks

radames · February 22, 2024, 11:26pm

Hi @mandelakori, Zephyr is an LLM developed by our team, for which we’ve manually enabled inference. For other large models, we currently recommend using Inference Endpoints.

Topic		Replies	Views
Inference service for large models, such as Vicuna 13b Beginners	0	1431	May 5, 2023
PRO Plan and for running huge models on free inference api? Beginners	1	1812	May 15, 2023
Inference API stopped working for my model 🤗Hub	11	5411	April 26, 2023
Error executing pinned inference model 🤗Hub	18	3798	December 10, 2021
Paid API Service Beginners	6	1545	January 6, 2023

Cannot run large models using API token

Related topics