Using gpt-j-6B in a CPU space without the InferenceAPI

I wanted to try out spaces with Gradio, to host a gpt-j-6B model with a slightly modified GPTJLMHeadModel. Therefore, I need to use .from_pretrained() to load the model and can’t use the inference API or load it via Gradio’s


After trying to get the model to run in a space, I am currently not sure if it is generally possible to host a downloaded gpt-j-6B model on huggingface spaces (with the free payment plan) and want to ask if this is correct.
I described the process for coming to this conclusion below.

I first had the problem that after downloading the model, the apps status was “running” but I only saw a message: “Error occured while trying to proxy:”,
which did not appear when I used a gpt2 model for example. After a few minutes the page showed the error message:
“CPU Memory limit exceeded (16GB)”.
Since this seemed to be related to the 16GB RAM limit, I tried out the first tip from here GPT-J
and used a smaller version with float16 precision.
When I loaded the model like this:

model = GPTJForCausalLM.from_pretrained("EleutherAI/gpt-j-6B", revision="float16", torch_dtype=torch.float16, low_cpu_mem_usage=True)

I got this error when trying to generate something:

Traceback (most recent call last):
  File "/home/user/.local/lib/python3.8/site-packages/gradio/", line 199, in predict
    prediction, durations = await run_in_threadpool(
  File "/home/user/.local/lib/python3.8/site-packages/starlette/", line 39, in run_in_threadpool
    return await anyio.to_thread.run_sync(func, *args)
  File "/home/user/.local/lib/python3.8/site-packages/anyio/", line 28, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(func, *args, cancellable=cancellable,
  File "/home/user/.local/lib/python3.8/site-packages/anyio/_backends/", line 818, in run_sync_in_worker_thread
    return await future
  File "/home/user/.local/lib/python3.8/site-packages/anyio/_backends/", line 754, in run
    result =, *args)
  File "/home/user/.local/lib/python3.8/site-packages/gradio/", line 530, in process
    predictions, durations = self.run_prediction(
  File "/home/user/.local/lib/python3.8/site-packages/gradio/", line 487, in run_prediction
    prediction = predict_fn(*processed_input)
  File "", line 11, in gptj6
    output = model.generate(input_ids=input_ids)
  File "/home/user/.local/lib/python3.8/site-packages/torch/autograd/", line 28, in decorate_context
    return func(*args, **kwargs)
  File "/home/user/.local/lib/python3.8/site-packages/transformers/", line 989, in generate
    return self.greedy_search(
  File "/home/user/.local/lib/python3.8/site-packages/transformers/", line 1291, in greedy_search
    outputs = self(
  File "/home/user/.local/lib/python3.8/site-packages/torch/nn/modules/", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/user/.local/lib/python3.8/site-packages/transformers/models/gptj/", line 774, in forward
    transformer_outputs = self.transformer(
  File "/home/user/.local/lib/python3.8/site-packages/torch/nn/modules/", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/user/.local/lib/python3.8/site-packages/transformers/models/gptj/", line 630, in forward
    outputs = block(
  File "/home/user/.local/lib/python3.8/site-packages/torch/nn/modules/", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/user/.local/lib/python3.8/site-packages/transformers/models/gptj/", line 274, in forward
    hidden_states = self.ln_1(hidden_states)
  File "/home/user/.local/lib/python3.8/site-packages/torch/nn/modules/", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/user/.local/lib/python3.8/site-packages/torch/nn/modules/", line 189, in forward
    return F.layer_norm(
  File "/home/user/.local/lib/python3.8/site-packages/torch/nn/", line 2347, in layer_norm
    return torch.layer_norm(input, normalized_shape, weight, bias, eps, torch.backends.cudnn.enabled)
RuntimeError: "LayerNormKernelImpl" not implemented for 'Half'

Wich is probably related to this comment here: Memory use of GPT-J-6B - #2 by sgugger about half-precision not being usable on CPU.

I then tried to load the model only using the arguement low_cpu_mem_usage=True:

model = GPTJForCausalLM.from_pretrained("EleutherAI/gpt-j-6B", low_cpu_mem_usage=True)

With this the app was running, although it took a while (>40 minutes) during which I only saw this message again: “Error occured while trying to proxy:”
When I now entered the input: “Hi”, the model generated for ~52 seconds until I got an error symbol in the Gradio app.
So, I found the enable_queue=True setting via another forum post (that I can’t link since I am a new user :sweat_smile:) which prevents this timeout.
The model then generated for I think ~900 seconds until I got another Gradio error with no message, which I assume is another timeout.

I appended the code that I used to test this below (it worked with gpt2 in spaces and with gptj on a local GPU):

import gradio as gr
import torch
from transformers import AutoTokenizer, GPTJForCausalLM, GPT2LMHeadModel

def gptj6(text):
    input_ids = sum([tokenizer.encode(text)], [])
    input_ids = [input_ids]
    input_ids = torch.tensor(input_ids, device=device)#, dtype=torch.long)
    output = model.generate(input_ids=input_ids)
    for generated_sequence in output:
        generated_sequence = generated_sequence.tolist()
        output = tokenizer.decode(generated_sequence, clean_up_tokenization_spaces=True, skip_special_tokens=True)
    return output

device = 'cuda' if torch.cuda.is_available() else 'cpu'
model_dir = "EleutherAI/gpt-j-6B"#"gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_dir)
#model = GPT2LMHeadModel.from_pretrained(model_dir)
model = GPTJForCausalLM.from_pretrained(model_dir, low_cpu_mem_usage=True)#revision="float16", torch_dtype=torch.float16, low_cpu_mem_usage=True)

iface = gr.Interface(fn=gptj6, inputs="text", outputs="text").launch(enable_queue=True)


1 Like