Running inference on OPT 30m on GPU

Thanks for the great work in addoing metaseq OPT models to transformers
I am trying to run generations using the huggingface checkpoint for 30B but I see a CUDA error:
FYI: I am able to run inference for 6,7B on the same system
My config: GPU models and configuration: Azure compute node with 8 gpus
Virtual machine size
Standard_ND40rs_v2 (40 cores, 672 GB RAM, 2900 GB disk)

Code
`from transformers import AutoModelForCausalLM, AutoTokenizer, set_seed
import torch
import torch.nn as nn
import os
torch.cuda.empty_cache()

os.environ[‘CUDA_DEVICE_ORDER’]=‘PCI_BUS_ID’
os.environ[‘CUDA_VISIBLE_DEVICES’]=‘0,1,2,3,4,5,6,7’

model = nn.DataParallel(AutoModelForCausalLM.from_pretrained(“facebook/opt-30b”, torch_dtype=torch.float16).cuda())

the fast tokenizer currently does not work correctly

tokenizer = AutoTokenizer.from_pretrained(“facebook/opt-30bb”, use_fast=False)

prompt = “India is and country in South East Asia and is known for”

input_ids = tokenizer(prompt, return_tensors=“pt”).input_ids.cuda()

set_seed(32)
generated_ids = model.module.generate(input_ids, do_sample=True, max_length=512)

print(tokenizer.batch_decode(generated_ids, skip_special_tokens=True))`

I see the error below:
File "/anaconda/envs/azureml_py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 570, in _apply module._apply(fn) File "/anaconda/envs/azureml_py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 570, in _apply module._apply(fn) [Previous line repeated 2 more times] File "/anaconda/envs/azureml_py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 593, in _apply param_applied = fn(param) File "/anaconda/envs/azureml_py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 680, in <lambda> return self._apply(lambda t: t.cuda(device)) RuntimeError: CUDA out of memory. Tried to allocate 392.00 MiB (GPU 0; 31.75 GiB total capacity; 30.18 GiB already allocated; 9.75 MiB free; 30.18 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

nvdia-smi
`±----------------------------------------------------------------------------+
| NVIDIA-SMI 470.103.01 Driver Version: 470.103.01 CUDA Version: 11.4 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2… On | 00000001:00:00.0 Off | 0 |
| N/A 36C P0 40W / 300W | 0MiB / 32510MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 1 Tesla V100-SXM2… On | 00000002:00:00.0 Off | 0 |
| N/A 36C P0 41W / 300W | 0MiB / 32510MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 2 Tesla V100-SXM2… On | 00000003:00:00.0 Off | 0 |
| N/A 34C P0 41W / 300W | 0MiB / 32510MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 3 Tesla V100-SXM2… On | 00000004:00:00.0 Off | 0 |
| N/A 36C P0 39W / 300W | 0MiB / 32510MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 4 Tesla V100-SXM2… On | 00000005:00:00.0 Off | 0 |
| N/A 35C P0 40W / 300W | 0MiB / 32510MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 5 Tesla V100-SXM2… On | 00000006:00:00.0 Off | 0 |
| N/A 38C P0 45W / 300W | 0MiB / 32510MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 6 Tesla V100-SXM2… On | 00000007:00:00.0 Off | 0 |
| N/A 38C P0 42W / 300W | 0MiB / 32510MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 7 Tesla V100-SXM2… On | 00000008:00:00.0 Off | 0 |
| N/A 38C P0 42W / 300W | 0MiB / 32510MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
±----------------------------------------------------------------------------+`

I am not sure if all GPUs are being used or whether i need the model to fit in a single GPU. Any leads would be great!

@patrickvonplaten In case you have any idea about this would be super helpful!

Hey @Radz,

We don’t recommend using nn.DataParallel and PyTorch doesn’t either anymore afaik: DataParallel — PyTorch 1.11.0 documentation

@sgugger is working on a feature that would make your use case very easy! See: Use Accelerate in `from_pretrained` for big model inference by sgugger · Pull Request #17341 · huggingface/transformers · GitHub