Hi, I am facing an out-of-memory issue in H100 for CodeLlama 34b/70b inference (sometimes working for single sequence output). For 13b models also I couldn’t generate more than 5 completion sequences. Below are my script and package versions.
accelerate==0.28.0
tokenizers==0.13.3
torch==2.2.1 ( torchaudio-2.2.1|py310_cu121)
transformers==4.33.0
triton==2.2.0
evaluate==0.4.1
Also tried with touch CUDA 11.8
torch==2.2.1 (pytorch-2.2.1|py3.10_cuda11.8_cudnn8.7.0_0)
Nvidia Driver and CUDA (tried both CUDA 11.8 and 12.1)
NVIDIA-SMI 535.129.03 Driver Version: 535.129.03 CUDA Version: 12.2
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("codellama/CodeLlama-34b-Instruct-hf", cache_dir=self.cache_dir)
model = AutoModelForCausalLM.from_pretrained(self.model_name, cache_dir=self.cache_dir).to(self.device)
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to("cuda:3")
try:
samples = model.generate(
input_ids,
temperature=0.9,
top_k=40,
top_p=0.5,
repetition_penalty=1.1,
max_new_tokens=1000,
min_new_tokens=3,
pad_token_id=tokenizer.eos_token_id,
do_sample=True,
num_return_sequences=20)
except ...
pass
Traceback (most recent call last):
File "/remote/vg_llm/nalaka/chatsv/Evaluation/./Eval.py", line 505, in <module>
runMain()
File "/remote/vg_llm/nalaka/chatsv/Evaluation/./Eval.py", line 471, in runMain
runEval(config)
File "/remote/vg_llm/nalaka/chatsv/Evaluation/./Eval.py", line 482, in runEval
model = create_model(testing_model, config)
File "/remote/vg_llm/nalaka/chatsv/Evaluation/./Eval.py", line 438, in create_model
model = HuggingFace_API(model_cfg)
File "/remote/vg_llm/nalaka/chatsv/Evaluation/hf_model.py", line 10, in __init__
self.init()
File "/remote/vg_llm/nalaka/chatsv/Evaluation/hf_model.py", line 32, in init
cache_dir=self.cache_dir).to(self.device)
File "/remote/vg_llm/nalaka/Anaconda/envs/eval_env/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2053, in to
return super().to(*args, **kwargs)
File "/remote/vg_llm/nalaka/Anaconda/envs/eval_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1152, in to
return self._apply(convert)
File "/remote/vg_llm/nalaka/Anaconda/envs/eval_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 802, in _apply
module._apply(fn)
File "/remote/vg_llm/nalaka/Anaconda/envs/eval_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 802, in _apply
module._apply(fn)
File "/remote/vg_llm/nalaka/Anaconda/envs/eval_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 802, in _apply
module._apply(fn)
[Previous line repeated 2 more times]
File "/remote/vg_llm/nalaka/Anaconda/envs/eval_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 825, in _apply
param_applied = fn(param)
File "/remote/vg_llm/nalaka/Anaconda/envs/eval_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1150, in convert
return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 688.00 MiB. GPU 3 has a total capacity of 79.11 GiB of which 306.56 MiB is free. Including non-PyTorch memory, this process has 78.68 GiB memory in use. Of the allocated memory 78.12 GiB is allocated by PyTorch, and 120.17 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)