Hello everyone!
I am trying to install GPT-J-6B on a powerful (more or less “powerful”) computer and I have encountered some problems.
I have followed the documentation examples (GPT-J — transformers 4.11.0.dev0 documentation) and also this guide (Use GPT-J 6 Billion Parameters Model with Huggingface).
The following are the specifications of the available resources:
-
transformers
version: 4.11.0.dev0 - Platform: Linux-5.4.0-84-generic-x86_64-with-Ubuntu-18.04-bionic
- Platform resources: 32GB RAM and 30GB Swap
- Python version: 3.6.9
- PyTorch version (GPU?): 1.9.0+cu111 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: Yes, a GeForce RTX 2080 SUPER (7981MiB)
- Using distributed or parallel set-up in script?: No
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.91.03 Driver Version: 460.91.03 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce RTX 208... Off | 00000000:02:00.0 On | N/A |
| 0% 43C P8 11W / 250W | 342MiB / 7981MiB | 20% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1186 G /usr/lib/xorg/Xorg 18MiB |
| 0 N/A N/A 1324 G /usr/bin/gnome-shell 70MiB |
| 0 N/A N/A 2673 G /usr/lib/xorg/Xorg 175MiB |
| 0 N/A N/A 2808 G /usr/bin/gnome-shell 34MiB |
| 0 N/A N/A 7608 G /usr/lib/firefox/firefox 10MiB |
| 0 N/A N/A 7782 G ...AAAAAAAAA= --shared-files 26MiB |
+-----------------------------------------------------------------------------+
I’ll start explaining what works for me: I’ve loaded the model into the machine’s RAM (no GPU, just CPU). It consumes the 32 GB of RAM and 17 GB of Swap. It takes 500 seconds (8 min) to load the model and then the RAM consumption drops to 24 GB of RAM and 14 of Swap. Sending an input and generating an output takes 2 minutes on average to send a response to the user.
First question: Is the memory consumption that is observed normal for this model? Do you see reasonable times for this level of RAM and Swap memory?
Code:
import time
from transformers import AutoModelForCausalLM, AutoTokenizer
start_time = time.time()
model = AutoModelForCausalLM.from_pretrained("EleutherAI/gpt-j-6B")
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-j-6B")
end_time = time.time() - start_time
print("Total Taken => ",end_time)
prompt = "In a shocking finding, scientists discovered a herd of unicorns living in a remote, " \
"previously unexplored valley, in the Andes Mountains. Even more surprising to the " \
"researchers was the fact that the unicorns spoke perfect English."
input_ids = tokenizer(prompt, return_tensors="pt").input_ids
start_time = time.time()
gen_tokens = model.generate(input_ids, do_sample=True, temperature=0.9, max_length=100,)
gen_text = tokenizer.batch_decode(gen_tokens)[0]
print(gen_text)
end_time = time.time() - start_time
print("Total Taken => ",end_time)
Seeing that the model was too much for the machine, I decided to lower the precision with the torch_dtype to float16 and load it on the GPU. But, after a few minutes and after consuming 32GB of RAM and 12 of Swap, with the following code the following exception arises:
Code:
import time
from transformers import GPTJForCausalLM, AutoTokenizer
import torch
start_time = time.time()
model = GPTJForCausalLM.from_pretrained("EleutherAI/gpt-j-6B", torch_dtype=torch.float16).to("cuda")
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-j-6B")
end_time = time.time() - start_time
print("Total Taken => ",end_time)
prompt = "In a shocking finding, scientists discovered a herd of unicorns living in a remote, " \
"previously unexplored valley, in the Andes Mountains. Even more surprising to the " \
"researchers was the fact that the unicorns spoke perfect English."
input_ids = tokenizer(prompt, return_tensors="pt").input_ids
start_time = time.time()
gen_tokens = model.generate(input_ids, do_sample=True, temperature=0.9, max_length=100,)
gen_text = tokenizer.batch_decode(gen_tokens)[0]
print(gen_text)
end_time = time.time() - start_time
print("Total Taken => ",end_time)
Output:
Traceback (most recent call last):
File "gpt.py", line 202, in <module>
model = GPTJForCausalLM.from_pretrained("EleutherAI/gpt-j-6B", torch_dtype=torch.float16).to("cuda")
File "/home/robotica/Escritorio/gpt/env/lib/python3.6/site-packages/torch/nn/modules/module.py", line 852, in to
return self._apply(convert)
File "/home/robotica/Escritorio/gpt/env/lib/python3.6/site-packages/torch/nn/modules/module.py", line 530, in _apply
module._apply(fn)
File "/home/robotica/Escritorio/gpt/env/lib/python3.6/site-packages/torch/nn/modules/module.py", line 530, in _apply
module._apply(fn)
File "/home/robotica/Escritorio/gpt/env/lib/python3.6/site-packages/torch/nn/modules/module.py", line 530, in _apply
module._apply(fn)
[Previous line repeated 2 more times]
File "/home/robotica/Escritorio/gpt/env/lib/python3.6/site-packages/torch/nn/modules/module.py", line 552, in _apply
param_applied = fn(param)
File "/home/robotica/Escritorio/gpt/env/lib/python3.6/site-packages/torch/nn/modules/module.py", line 850, in convert
return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
RuntimeError: CUDA out of memory. Tried to allocate 32.00 MiB (GPU 0; 7.79 GiB total capacity; 6.07 GiB already allocated; 38.56 MiB free; 6.07 GiB reserved in total by PyTorch)
Second question: Is this exception due to running out of memory on the GPU? How much VRAM does the GPT-J-6B consume to fit in the GPU?
Seeing that this was not working either I decided not to use the GPU and use only the CPU with float16 precision. But then another exception arises:
Code:
import time
from transformers import GPTJForCausalLM, AutoTokenizer
import torch
start_time = time.time()
model = GPTJForCausalLM.from_pretrained("EleutherAI/gpt-j-6B", torch_dtype=torch.float16)
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-j-6B")
end_time = time.time() - start_time
print("Total Taken => ",end_time)
prompt = "In a shocking finding, scientists discovered a herd of unicorns living in a remote, " \
"previously unexplored valley, in the Andes Mountains. Even more surprising to the " \
"researchers was the fact that the unicorns spoke perfect English."
input_ids = tokenizer(prompt, return_tensors="pt").input_ids
start_time = time.time()
gen_tokens = model.generate(input_ids, do_sample=True, temperature=0.9, max_length=100,)
gen_text = tokenizer.batch_decode(gen_tokens)[0]
print(gen_text)
end_time = time.time() - start_time
print("Total Taken => ",end_time)
Output:
Total Taken => 177.10330414772034
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Traceback (most recent call last):
File "gpt.py", line 128, in <module>
gen_tokens = model.generate(input_ids, do_sample=True, temperature=0.9, max_length=100,)
File "/home/robotica/Escritorio/gpt/env/lib/python3.6/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
return func(*args, **kwargs)
File "/home/robotica/Escritorio/gpt/env/lib/python3.6/site-packages/transformers/generation_utils.py", line 1026, in generate
**model_kwargs,
File "/home/robotica/Escritorio/gpt/env/lib/python3.6/site-packages/transformers/generation_utils.py", line 1533, in sample
output_hidden_states=output_hidden_states,
File "/home/robotica/Escritorio/gpt/env/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/home/robotica/Escritorio/gpt/env/lib/python3.6/site-packages/transformers/models/gptj/modeling_gptj.py", line 780, in forward
return_dict=return_dict,
File "/home/robotica/Escritorio/gpt/env/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/home/robotica/Escritorio/gpt/env/lib/python3.6/site-packages/transformers/models/gptj/modeling_gptj.py", line 631, in forward
output_attentions=output_attentions,
File "/home/robotica/Escritorio/gpt/env/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/home/robotica/Escritorio/gpt/env/lib/python3.6/site-packages/transformers/models/gptj/modeling_gptj.py", line 274, in forward
hidden_states = self.ln_1(hidden_states)
File "/home/robotica/Escritorio/gpt/env/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/home/robotica/Escritorio/gpt/env/lib/python3.6/site-packages/torch/nn/modules/normalization.py", line 174, in forward
input, self.normalized_shape, self.weight, self.bias, self.eps)
File "/home/robotica/Escritorio/gpt/env/lib/python3.6/site-packages/torch/nn/functional.py", line 2346, in layer_norm
return torch.layer_norm(input, normalized_shape, weight, bias, eps, torch.backends.cudnn.enabled)
RuntimeError: "LayerNormKernelImpl" not implemented for 'Half'
I think this last message is a bug, but I would like to know your opinion, lest I have programmed something wrong… I think that there is an incompatibility between a programmed layer and the float 16 precision and the model can not be user with this configuration.
Third question, what are the hardware specifications to run this model? In the documentation they explain that with a GPU it would be only consume 24 GB of RAM but they do not talk about the capacity of the graph to reach this consumption. In the same way, loading each of these models has used around 50 GB of RAM. Is there no way to reduce the use of RAM when loading the model?
Will there always be those peaks of consumption of 50 GB of RAM when I try to load the model and then it will drop to 24 GB?
Thanks for your time reading me ^.^