How does the GPT-J inference API work?

wittyseller · September 27, 2021, 2:58pm

Hi All.
I started a 7 day trial for the startup plan. I need to use GPT-J through HF inference API. I pinned it in on org account to work on GPU and, after sending a request, all I get back is a single generated word. The max token param is set to 100.
Could you please let me know how should I make it generate more than one word

nielsr · September 28, 2021, 7:46am

Hi,

Normally it should become available. I’ll ask the team and get back to you.

nielsr · September 28, 2021, 8:03am

GPT-J is supported by the inference API. You can try it out here: EleutherAI/gpt-j-6B · Hugging Face

Can you verify whether you are able to generate more than a given word?

wittyseller · September 29, 2021, 7:03am

Thanks for your reply! I verified and it still generates only one word for me. I’m sending the below mentioned json in my request to https://api-inference.huggingface.co/models/EleutherAI/gpt-j-6B

{
“inputs”: “Nowadays, many users would like to upgrade old hard drive to SSD with Windows installed, or reinstall Windows 10 on SSD afterward. The faster boot speed and reading & writing speed make it known as a better boot drive.\r\n\r\nTo be specifc, after installing the operating system on a new SSD, you’ll find the computer boots up faster, and runs smoothly even with muliple programs in the background.\r\n\r\nSo let’s”,
“parameters”: {
“return_full_text”: false,
“max_new_tokens”: 100,
“temperature”: 0.8
},
“options”: {
“use_cache”: false
}
}

And I’m getting back this:

HTTP/1.1 200 OK
date: Wed, 29 Sep 2021 06:56:12 GMT,Wed, 29 Sep 2021 06:56:21 GMT
server: istio-envoy
x-compute-time: 0.1596
x-compute-type: gpu
access-control-expose-headers: x-compute-type, x-compute-time
x-compute-characters: 406
content-length: 27
content-type: application/json
x-envoy-upstream-service-time: 177

[{“generated_text”:" get"}]

Is something wrong on my side?

wittyseller · September 29, 2021, 7:21am

It’s strange: if I send only one sentence as input, it works:

So this is my request:
{
“inputs”: “Nowadays, many users would like to upgrade old hard drive to SSD with Windows installed, or reinstall Windows 10 on SSD afterward.”,
“parameters”: {
“return_full_text”: false,
“max_new_tokens”: 100,
“temperature”: 0.8
},
“options”: {
“use_cache”: false
}
}

And I get back this:

HTTP/1.1 200 OK
date: Wed, 29 Sep 2021 07:06:20 GMT,Wed, 29 Sep 2021 07:06:29 GMT
server: istio-envoy
x-compute-time: 2.1995999999999998
x-compute-type: gpu
access-control-expose-headers: x-compute-type, x-compute-time
x-compute-characters: 130
content-length: 130
content-type: application/json
x-envoy-upstream-service-time: 2244

[{“generated_text”:" In this way, they can enjoy the performance of SSD and the convenience of Windows. In the first place, the"}]

To me, it looks like there is some sort of a limit on the input tokens

kaisar123 · October 8, 2021, 3:48pm

Try using “max_length” parameter instead of “max_new_tokens”. The documentation suggests they serve the same purpose and you should not use both simultaneously. Worked for me.

Topic		Replies	Views
Accelerated Inference for gpt-j using javascript 🤗Transformers	1	530	April 3, 2022
Api parameters for gpt-j accelerated inference in javascript Beginners	0	313	April 4, 2022
GPT-J generating chatbot response 🤗Transformers	2	2681	September 23, 2022
Default gpt-j output length Beginners	0	363	April 23, 2022
Generating text word by word 🤗Transformers	2	900	December 19, 2023

How does the GPT-J inference API work?

Related topics