Cannot run large models using API token

Hello,

I am having the following two issues.

  1. I cannot run large models using the inference API. For example, if I run the following,
import requests

API_URL = "https://api-inference.huggingface.co/models/EleutherAI/gpt-neox-20b"
headers = {"Authorization": "Bearer <MY_API_KEY_HERE>"}

def query(payload):
	response = requests.post(API_URL, headers=headers, json=payload)
	return response.json()
	
output = query({
	"inputs": "Can you please let us know more details about your ",
})
print(output)

I get this error

{'error': 'Model EleutherAI/gpt-neox-20b is currently loading', 'estimated_time': 1651.7474365234375}

Why does this happen, and is there a way around the issue?

  1. Even for the smaller models I do manage to run successfully, the output is different from the one generated in the user interface, for example the code below
import requests

API_URL = "https://api-inference.huggingface.co/models/EleutherAI/gpt-neo-2.7B"
headers = {"Authorization": "Bearer <MY_API_KEY_HERE>"}

def query(payload):
	response = requests.post(API_URL, headers=headers, json=payload)
	return response.json()
	
output = query({
	"inputs": "Can you please let us know more details about your ",
})
print(output)

generated the following

[{'generated_text': 'Can you please let us know more details about your  \nschedule.\n\nThanks,   \nLiz Taylor  \n\n-----Original Message----- \nFrom: Dasovich, Jeff [mailto:Jeff.D'}]

but on the website it is different (see below)

Why is this the case? Is there a way to ensure the outputs using the free Inference API are more aligned with those of the web UI?

Thanks

hi @AndreaSottana , that is a very large model, it takes a long time to load on our API inference.
Our API inference is suitable for testing and evaluation. If you’re looking for less latency you probably need our dedicated service Inference Endpoints

You can read more about how the hub inference API works here

Hi @radames

Many thanks for letting me know. I have two follow up questions:

  1. Is there a way to load one of your large models locally for inference using model parallelism without having to manually edit the model’s internal code? I have access to multiple GPUs, and a single GPU with 24GB cannot load a 20 billion parameters model in memory, but with multiple GPUs it would be possible, however when I load the model it only tries to load it onto a single GPU and then gives an out of memory error.

  2. Why are the results given by the UI when I run a model on the hub non-deterministic, but when I download it locally and run inference it appears to be deterministic and give a different output than the one I get on the hub? I am talking mostly about generative models.

Thanks a lot

1 Like

I am also unable to run a large model on the inference API, specifically Salesforce/codegen-16B-mono. Neither trying it out using the widget on the website nor using a REST request through python works. In both cases I get a time out, for example the widget gives the following output after some time: Model Salesforce/codegen-16B-mono time out.

Is that because the model is too big or because something in the backend is broken for that model. For the latter case, should I ask the model’s authors for help?

Hi,

This makes sense however these two models:
zephyr
aisak-assistant
are the exact same size, but the former is able to run on inference api, wheras the latter cannot. Could you please provide me with a solution/explanation?

Thanks

Hi @mandelakori, Zephyr is an LLM developed by our team, for which we’ve manually enabled inference. For other large models, we currently recommend using Inference Endpoints.