Memory use of GPT-J-6B

Hello everyone!

I am trying to install GPT-J-6B on a powerful (more or less “powerful”) computer and I have encountered some problems.

I have followed the documentation examples (GPT-J — transformers 4.11.0.dev0 documentation) and also this guide (Use GPT-J 6 Billion Parameters Model with Huggingface).

The following are the specifications of the available resources:

  • transformers version: 4.11.0.dev0
  • Platform: Linux-5.4.0-84-generic-x86_64-with-Ubuntu-18.04-bionic
  • Platform resources: 32GB RAM and 30GB Swap
  • Python version: 3.6.9
  • PyTorch version (GPU?): 1.9.0+cu111 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: Yes, a GeForce RTX 2080 SUPER (7981MiB)
  • Using distributed or parallel set-up in script?: No
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.91.03    Driver Version: 460.91.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce RTX 208...  Off  | 00000000:02:00.0  On |                  N/A |
|  0%   43C    P8    11W / 250W |    342MiB /  7981MiB |     20%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1186      G   /usr/lib/xorg/Xorg                 18MiB |
|    0   N/A  N/A      1324      G   /usr/bin/gnome-shell               70MiB |
|    0   N/A  N/A      2673      G   /usr/lib/xorg/Xorg                175MiB |
|    0   N/A  N/A      2808      G   /usr/bin/gnome-shell               34MiB |
|    0   N/A  N/A      7608      G   /usr/lib/firefox/firefox           10MiB |
|    0   N/A  N/A      7782      G   ...AAAAAAAAA= --shared-files       26MiB |
+-----------------------------------------------------------------------------+

I’ll start explaining what works for me: I’ve loaded the model into the machine’s RAM (no GPU, just CPU). It consumes the 32 GB of RAM and 17 GB of Swap. It takes 500 seconds (8 min) to load the model and then the RAM consumption drops to 24 GB of RAM and 14 of Swap. Sending an input and generating an output takes 2 minutes on average to send a response to the user.

First question: Is the memory consumption that is observed normal for this model? Do you see reasonable times for this level of RAM and Swap memory?

Code:

import time
from transformers import AutoModelForCausalLM, AutoTokenizer
start_time = time.time()
model = AutoModelForCausalLM.from_pretrained("EleutherAI/gpt-j-6B")
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-j-6B")
end_time = time.time() - start_time
print("Total Taken => ",end_time)
prompt = "In a shocking finding, scientists discovered a herd of unicorns living in a remote, " \
         "previously unexplored valley, in the Andes Mountains. Even more surprising to the " \
          "researchers was the fact that the unicorns spoke perfect English."

input_ids = tokenizer(prompt, return_tensors="pt").input_ids
start_time = time.time()
gen_tokens = model.generate(input_ids, do_sample=True, temperature=0.9, max_length=100,)
gen_text = tokenizer.batch_decode(gen_tokens)[0]
print(gen_text)
end_time = time.time() - start_time
print("Total Taken => ",end_time)

Seeing that the model was too much for the machine, I decided to lower the precision with the torch_dtype to float16 and load it on the GPU. But, after a few minutes and after consuming 32GB of RAM and 12 of Swap, with the following code the following exception arises:

Code:

import time
from transformers import GPTJForCausalLM, AutoTokenizer
import torch

start_time = time.time()
model =  GPTJForCausalLM.from_pretrained("EleutherAI/gpt-j-6B", torch_dtype=torch.float16).to("cuda")
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-j-6B")
end_time = time.time() - start_time
print("Total Taken => ",end_time)
prompt = "In a shocking finding, scientists discovered a herd of unicorns living in a remote, " \
         "previously unexplored valley, in the Andes Mountains. Even more surprising to the " \
          "researchers was the fact that the unicorns spoke perfect English."

input_ids = tokenizer(prompt, return_tensors="pt").input_ids
start_time = time.time()
gen_tokens = model.generate(input_ids, do_sample=True, temperature=0.9, max_length=100,)
gen_text = tokenizer.batch_decode(gen_tokens)[0]
print(gen_text)
end_time = time.time() - start_time
print("Total Taken => ",end_time)

Output:

Traceback (most recent call last):
  File "gpt.py", line 202, in <module>
    model =  GPTJForCausalLM.from_pretrained("EleutherAI/gpt-j-6B", torch_dtype=torch.float16).to("cuda")
  File "/home/robotica/Escritorio/gpt/env/lib/python3.6/site-packages/torch/nn/modules/module.py", line 852, in to
    return self._apply(convert)
  File "/home/robotica/Escritorio/gpt/env/lib/python3.6/site-packages/torch/nn/modules/module.py", line 530, in _apply
    module._apply(fn)
  File "/home/robotica/Escritorio/gpt/env/lib/python3.6/site-packages/torch/nn/modules/module.py", line 530, in _apply
    module._apply(fn)
  File "/home/robotica/Escritorio/gpt/env/lib/python3.6/site-packages/torch/nn/modules/module.py", line 530, in _apply
    module._apply(fn)
  [Previous line repeated 2 more times]
  File "/home/robotica/Escritorio/gpt/env/lib/python3.6/site-packages/torch/nn/modules/module.py", line 552, in _apply
    param_applied = fn(param)
  File "/home/robotica/Escritorio/gpt/env/lib/python3.6/site-packages/torch/nn/modules/module.py", line 850, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
RuntimeError: CUDA out of memory. Tried to allocate 32.00 MiB (GPU 0; 7.79 GiB total capacity; 6.07 GiB already allocated; 38.56 MiB free; 6.07 GiB reserved in total by PyTorch)

Second question: Is this exception due to running out of memory on the GPU? How much VRAM does the GPT-J-6B consume to fit in the GPU?

Seeing that this was not working either I decided not to use the GPU and use only the CPU with float16 precision. But then another exception arises:

Code:

import time
from transformers import GPTJForCausalLM, AutoTokenizer
import torch

start_time = time.time()
model =  GPTJForCausalLM.from_pretrained("EleutherAI/gpt-j-6B", torch_dtype=torch.float16)
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-j-6B")
end_time = time.time() - start_time
print("Total Taken => ",end_time)
prompt = "In a shocking finding, scientists discovered a herd of unicorns living in a remote, " \
         "previously unexplored valley, in the Andes Mountains. Even more surprising to the " \
          "researchers was the fact that the unicorns spoke perfect English."

input_ids = tokenizer(prompt, return_tensors="pt").input_ids
start_time = time.time()
gen_tokens = model.generate(input_ids, do_sample=True, temperature=0.9, max_length=100,)
gen_text = tokenizer.batch_decode(gen_tokens)[0]
print(gen_text)
end_time = time.time() - start_time
print("Total Taken => ",end_time)

Output:

Total Taken =>  177.10330414772034
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Traceback (most recent call last):
  File "gpt.py", line 128, in <module>
    gen_tokens = model.generate(input_ids, do_sample=True, temperature=0.9, max_length=100,)
  File "/home/robotica/Escritorio/gpt/env/lib/python3.6/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
    return func(*args, **kwargs)
  File "/home/robotica/Escritorio/gpt/env/lib/python3.6/site-packages/transformers/generation_utils.py", line 1026, in generate
    **model_kwargs,
  File "/home/robotica/Escritorio/gpt/env/lib/python3.6/site-packages/transformers/generation_utils.py", line 1533, in sample
    output_hidden_states=output_hidden_states,
  File "/home/robotica/Escritorio/gpt/env/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/robotica/Escritorio/gpt/env/lib/python3.6/site-packages/transformers/models/gptj/modeling_gptj.py", line 780, in forward
    return_dict=return_dict,
  File "/home/robotica/Escritorio/gpt/env/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/robotica/Escritorio/gpt/env/lib/python3.6/site-packages/transformers/models/gptj/modeling_gptj.py", line 631, in forward
    output_attentions=output_attentions,
  File "/home/robotica/Escritorio/gpt/env/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/robotica/Escritorio/gpt/env/lib/python3.6/site-packages/transformers/models/gptj/modeling_gptj.py", line 274, in forward
    hidden_states = self.ln_1(hidden_states)
  File "/home/robotica/Escritorio/gpt/env/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/robotica/Escritorio/gpt/env/lib/python3.6/site-packages/torch/nn/modules/normalization.py", line 174, in forward
    input, self.normalized_shape, self.weight, self.bias, self.eps)
  File "/home/robotica/Escritorio/gpt/env/lib/python3.6/site-packages/torch/nn/functional.py", line 2346, in layer_norm
    return torch.layer_norm(input, normalized_shape, weight, bias, eps, torch.backends.cudnn.enabled)
RuntimeError: "LayerNormKernelImpl" not implemented for 'Half'

I think this last message is a bug, but I would like to know your opinion, lest I have programmed something wrong… I think that there is an incompatibility between a programmed layer and the float 16 precision and the model can not be user with this configuration.

Third question, what are the hardware specifications to run this model? In the documentation they explain that with a GPU it would be only consume 24 GB of RAM but they do not talk about the capacity of the graph to reach this consumption. In the same way, loading each of these models has used around 50 GB of RAM. Is there no way to reduce the use of RAM when loading the model?
Will there always be those peaks of consumption of 50 GB of RAM when I try to load the model and then it will drop to 24 GB?

Thanks for your time reading me ^.^

2 Likes

You need at least 12GB of GPU RAM for to put the model on the GPU and your GPU has less memory than that, so you won’t be able to use it on the GPU of this machine. You can’t use it in half precision on CPU because all layers of the models are not implemented for half precision (like the layernorm layer) so you need to use the model in full precision on the CPU to make predictions (that will take a looooooooong time).

AS for the RAM footprint, we are working on a way to load the model with from_pretrained to only consume the model memory in RAM (currently it consumes twice the model size). It should be merged soon.

1 Like

Is the model available through Inference API? I tried to use it with a startup plan but it only returns the first word.

Hi @sgugger, any updates on the possbility to reduce the RAM footprint on this model?
Eager to test it out for our application, but our 11GB of GPU RAM doesn’t seem to be enough to run the float16 model.
I cannot find a way to load the model into 2 separate GPU devices either (this would also be a solution for many of us) and so hoping for your solution to allow us to move forward!

I never said we had a solution to load the model with less than 12GB of GPU RAM. That is just not possible at the moment. We have merged the PR that adds a low_memory argument to from_pretrained to be able to load the model with 12GB of CPU RAM.

Obviously such a big model requires specialized hardware.

1 Like

How do we use the low_memory option to load the model?

You can specify the low_cpu_mem_usage=True argument to the from_pretrained method.

1 Like

I’m using an Nvidia RTX 3090 that has 24GB of dedicated VRAM. For some reason, I am still running out of memory… am I doing something wrong?

from transformers import GPTJForCausalLM, AutoTokenizer
import torch    

model = GPTJForCausalLM.from_pretrained("EleutherAI/gpt-j-6B", torch_dtype=torch.float16, low_cpu_mem_usage=True).to("cuda")

Traceback (most recent call last):
  File "C:\Users\wolfg\code\aibrush-2\worker\gptjtest.py", line 5, in <module>
    model = GPTJForCausalLM.from_pretrained("EleutherAI/gpt-j-6B", torch_dtype=torch.float16, low_cpu_mem_usage=True).to("cuda")
  File "C:\Users\wolfg\anaconda3\envs\vqgan\lib\site-packages\torch\nn\modules\module.py", line 852, in to
    return self._apply(convert)
  File "C:\Users\wolfg\anaconda3\envs\vqgan\lib\site-packages\torch\nn\modules\module.py", line 530, in _apply
    module._apply(fn)
  File "C:\Users\wolfg\anaconda3\envs\vqgan\lib\site-packages\torch\nn\modules\module.py", line 552, in _apply
    param_applied = fn(param)
  File "C:\Users\wolfg\anaconda3\envs\vqgan\lib\site-packages\torch\nn\modules\module.py", line 850, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
RuntimeError: CUDA out of memory. Tried to allocate 788.00 MiB (GPU 0; 24.00 GiB total capacity; 21.88 GiB already allocated; 399.81 MiB free; 21.89 GiB reserved in total by PyTorch)

Solved my issue. Needed this:

model = GPTJForCausalLM.from_pretrained("EleutherAI/gpt-j-6B", torch_dtype=torch.float16, low_cpu_mem_usage=True).to("cuda",  torch_dtype=torch.float16)

Also need to move the input ids to the GPU:

input_ids = tokenizer(context, return_tensors="pt").input_ids.to("cuda")
1 Like

Getting same issue but only using CPU mode (not doing any cuda() at all). And it only happens for gpt-j, not gpt-neo. Has anyone been able to run GPT-J on CPU only, or is this a bug? (It doesn’t matter to me if it’s slow; I only need to run it for some calibration purposes, not real-time)

2 Likes

any luck solving this issue?

i’m facing the same problem, tried with low_cpu_mem_usage and without

man you saved my day. btw, my solution is model = model.to(torch.float16).

Does this also work when using pipelines?