Hello,
I wanted to try out spaces with Gradio, to host a gpt-j-6B model with a slightly modified GPTJLMHeadModel. Therefore, I need to use .from_pretrained() to load the model and can’t use the inference API or load it via Gradio’s
gr.Interface.load("huggingface/EleutherAI/gpt-j-6B").
After trying to get the model to run in a space, I am currently not sure if it is generally possible to host a downloaded gpt-j-6B model on huggingface spaces (with the free payment plan) and want to ask if this is correct.
I described the process for coming to this conclusion below.
I first had the problem that after downloading the model, the apps status was “running” but I only saw a message: “Error occured while trying to proxy: hf.space/”,
which did not appear when I used a gpt2 model for example. After a few minutes the page showed the error message:
“CPU Memory limit exceeded (16GB)”.
Since this seemed to be related to the 16GB RAM limit, I tried out the first tip from here GPT-J
and used a smaller version with float16 precision.
When I loaded the model like this:
model = GPTJForCausalLM.from_pretrained("EleutherAI/gpt-j-6B", revision="float16", torch_dtype=torch.float16, low_cpu_mem_usage=True)
I got this error when trying to generate something:
Traceback (most recent call last):
File "/home/user/.local/lib/python3.8/site-packages/gradio/app.py", line 199, in predict
prediction, durations = await run_in_threadpool(
File "/home/user/.local/lib/python3.8/site-packages/starlette/concurrency.py", line 39, in run_in_threadpool
return await anyio.to_thread.run_sync(func, *args)
File "/home/user/.local/lib/python3.8/site-packages/anyio/to_thread.py", line 28, in run_sync
return await get_asynclib().run_sync_in_worker_thread(func, *args, cancellable=cancellable,
File "/home/user/.local/lib/python3.8/site-packages/anyio/_backends/_asyncio.py", line 818, in run_sync_in_worker_thread
return await future
File "/home/user/.local/lib/python3.8/site-packages/anyio/_backends/_asyncio.py", line 754, in run
result = context.run(func, *args)
File "/home/user/.local/lib/python3.8/site-packages/gradio/interface.py", line 530, in process
predictions, durations = self.run_prediction(
File "/home/user/.local/lib/python3.8/site-packages/gradio/interface.py", line 487, in run_prediction
prediction = predict_fn(*processed_input)
File "HF-Spaces_Test.py", line 11, in gptj6
output = model.generate(input_ids=input_ids)
File "/home/user/.local/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
return func(*args, **kwargs)
File "/home/user/.local/lib/python3.8/site-packages/transformers/generation_utils.py", line 989, in generate
return self.greedy_search(
File "/home/user/.local/lib/python3.8/site-packages/transformers/generation_utils.py", line 1291, in greedy_search
outputs = self(
File "/home/user/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/user/.local/lib/python3.8/site-packages/transformers/models/gptj/modeling_gptj.py", line 774, in forward
transformer_outputs = self.transformer(
File "/home/user/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/user/.local/lib/python3.8/site-packages/transformers/models/gptj/modeling_gptj.py", line 630, in forward
outputs = block(
File "/home/user/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/user/.local/lib/python3.8/site-packages/transformers/models/gptj/modeling_gptj.py", line 274, in forward
hidden_states = self.ln_1(hidden_states)
File "/home/user/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/user/.local/lib/python3.8/site-packages/torch/nn/modules/normalization.py", line 189, in forward
return F.layer_norm(
File "/home/user/.local/lib/python3.8/site-packages/torch/nn/functional.py", line 2347, in layer_norm
return torch.layer_norm(input, normalized_shape, weight, bias, eps, torch.backends.cudnn.enabled)
RuntimeError: "LayerNormKernelImpl" not implemented for 'Half'
Wich is probably related to this comment here: Memory use of GPT-J-6B - #2 by sgugger about half-precision not being usable on CPU.
I then tried to load the model only using the arguement low_cpu_mem_usage=True:
model = GPTJForCausalLM.from_pretrained("EleutherAI/gpt-j-6B", low_cpu_mem_usage=True)
With this the app was running, although it took a while (>40 minutes) during which I only saw this message again: “Error occured while trying to proxy: hf.space/”
When I now entered the input: “Hi”, the model generated for ~52 seconds until I got an error symbol in the Gradio app.
So, I found the enable_queue=True setting via another forum post (that I can’t link since I am a new user ) which prevents this timeout.
The model then generated for I think ~900 seconds until I got another Gradio error with no message, which I assume is another timeout.
I appended the code that I used to test this below (it worked with gpt2 in spaces and with gptj on a local GPU):
import gradio as gr
import torch
from transformers import AutoTokenizer, GPTJForCausalLM, GPT2LMHeadModel
def gptj6(text):
input_ids = sum([tokenizer.encode(text)], [])
input_ids = [input_ids]
input_ids = torch.tensor(input_ids, device=device)#, dtype=torch.long)
output = model.generate(input_ids=input_ids)
for generated_sequence in output:
generated_sequence = generated_sequence.tolist()
output = tokenizer.decode(generated_sequence, clean_up_tokenization_spaces=True, skip_special_tokens=True)
return output
device = 'cuda' if torch.cuda.is_available() else 'cpu'
model_dir = "EleutherAI/gpt-j-6B"#"gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_dir)
#model = GPT2LMHeadModel.from_pretrained(model_dir)
model = GPTJForCausalLM.from_pretrained(model_dir, low_cpu_mem_usage=True)#revision="float16", torch_dtype=torch.float16, low_cpu_mem_usage=True)
model.to(device).eval()
iface = gr.Interface(fn=gptj6, inputs="text", outputs="text").launch(enable_queue=True)
Best,
Be-Lo