I’m trying to run “meta-llama/Llama-3.2-90B-Vision-Instruct” as a containerized vllm on openshift with a single NVIDIA v100 GPU (with 8 GPUs available).
The pod runs, however after about 2 minutes fails with a large error trace which includes the following error:
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 224.00 MiB. GPU 0 has a total capacity of 79.10 GiB of which 80.31 MiB is free. Process 2718503 has 42.07 GiB memory in use. Process 3312520 has 36.93 GiB memory in use. Of the allocated memory 33.80 GiB is allocated by PyTorch, and 55.48 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
[rank0]:[W1121 09:26:13.404528307 CudaIPCTypes.cpp:16] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]
This suggests a shortage of memory resources for the GPU, but the GPU appears to be under-utilised and the server doesn’t seem to be heavily utilised.
- output from nvidia gpu-check:
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.127.05 Driver Version: 550.127.05 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA H100 80GB HBM3 On | 00000000:DB:00.0 Off | 0 |
| N/A 27C P0 69W / 700W | 4MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
This leads me to wonder if the problem is with the configuration of the pod, including the arguments passed to the LLM.
container:
name: pepper-vllm-90b
image: pepperairepo.azurecr.io/vllm/vllm-openai:0.6.3
imagePullPolicy: IfNotPresent
args:
- --model
- meta-llama/Llama-3.2-90B-Vision-Instruct
- --gpu-memory-utilization
- "0.6"
- --max_model_len
- "4096"
- --max_num_seqs
- "4"
- --enforce_eager
- --tensor-parallel-size
- "4"
volumeMounts:
cacheVolume: /root/.cache/huggingface # "/Model2" <- path from nfs where the model is stored
containerPort: 8000
shmSize: 16Gi
I’ve varied these parameters according to many suggestions on the internet, including changing --tensor-parallel-size
. Nothing seems to stop the memory related crash.
Questions:
a) Has anyone encountered this problem before?
b) Any other troubleshooting approaches for diagnosing and fixing this issue?
For completeness, here is the full context around the memory failure:
INFO 11-21 09:26:12 selector.py:115] Using XFormers backend.
ESC[1;36m(VllmWorkerProcess pid=352)ESC[0;0m INFO 11-21 09:26:12 selector.py:115] Using XFormers backend.
ESC[1;36m(VllmWorkerProcess pid=353)ESC[0;0m INFO 11-21 09:26:12 multiproc_worker_utils.py:242] Worker exiting
ESC[1;36m(VllmWorkerProcess pid=351)ESC[0;0m INFO 11-21 09:26:12 multiproc_worker_utils.py:242] Worker exiting
ESC[1;36m(VllmWorkerProcess pid=352)ESC[0;0m INFO 11-21 09:26:12 multiproc_worker_utils.py:242] Worker exiting
INFO 11-21 09:26:12 multiproc_worker_utils.py:121] Killing local vLLM worker processes
Process SpawnProcess-1:
Traceback (most recent call last):
File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 392, in run_mp_engine
engine = MQLLMEngine.from_engine_args(engine_args=engine_args,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 141, in from_engine_args
return cls(
^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 78, in __init__
self.engine = LLMEngine(*args,
^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 335, in __init__
self.model_executor = executor_class(
^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/executor/distributed_gpu_executor.py", line 26, in __init__
super().__init__(*args, **kwargs)
File "/usr/local/lib/python3.12/dist-packages/vllm/executor/executor_base.py", line 47, in __init__
self._init_executor()
File "/usr/local/lib/python3.12/dist-packages/vllm/executor/multiproc_gpu_executor.py", line 111, in _init_executor
:
self._run_workers("load_model",
File "/usr/local/lib/python3.12/dist-packages/vllm/executor/multiproc_gpu_executor.py", line 192, in _run_workers
driver_worker_output = driver_worker_method(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 183, in load_model
self.model_runner.load_model()
File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1062, in load_model
self.model = get_model(model_config=self.model_config,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/__init__.py", line 19, in get_model
return loader.load_model(model_config=model_config,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/loader.py", line 398, in load_model
model = _initialize_model(model_config, self.load_config,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/loader.py", line 175, in _initialize_model
return build_model(
^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/loader.py", line 160, in build_model
return model_class(config=hf_config,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/mllama.py", line 912, in __init__
self.language_model = MllamaForCausalLM(
^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/mllama.py", line 856, in __init__
self.model = MllamaTextModel(config, cache_config, quant_config)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/mllama.py", line 794, in __init__
LlamaDecoderLayer(config,
File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/llama.py", line 229, in __init__
self.mlp = LlamaMLP(
^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/llama.py", line 71, in __init__
self.gate_up_proj = MergedColumnParallelLinear(
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/linear.py", line 424, in __init__
super().__init__(input_size=input_size,
File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/linear.py", line 304, in __init__
self.quant_method.create_weights(
File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/linear.py", line 122, in create_weights
weight = Parameter(torch.empty(sum(output_partition_sizes),
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/utils/_device.py", line 79, in __torch_function__
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 224.00 MiB. GPU 0 has a total capacity of 79.10 GiB of which 80.31 MiB is free. Process 2718503 has 42.07 GiB memory in use. Process 3312520 has 36.93 GiB memory in use. Of the allocated memory 33.80 GiB is allocated by PyTorch, and 55.48 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
[rank0]:[W1121 09:26:13.404528307 CudaIPCTypes.cpp:16] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]
Traceback (most recent call last):
File "<frozen runpy>", line 198, in _run_module_as_main
File "<frozen runpy>", line 88, in _run_code
File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 585, in <module>
uvloop.run(run_server(args))
File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 109, in run
return __asyncio.run(
^^^^^^^^^^^^^^
File "/usr/lib/python3.12/asyncio/runners.py", line 194, in run
return runner.run(main)
return __asyncio.run(
^^^^^^^^^^^^^^
File "/usr/lib/python3.12/asyncio/runners.py", line 194, in run
return runner.run(main)
^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
return self._loop.run_until_complete(task)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "uvloop/loop.pyx", line 1517, in uvloop.loop.Loop.run_until_complete
File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 61, in wrapper
return await main
^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 552, in run_server
async with build_async_engine_client(args) as engine_client:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
return await anext(self.gen)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 107, in build_async_engine_client
async with build_async_engine_client_from_engine_args(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
return await anext(self.gen)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 194, in build_async_engine_client_from_engine_args
raise RuntimeError(
RuntimeError: Engine process failed to start
/usr/lib/python3.12/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
Some information on the physical server running this vllm:
- OC nodes (
oc describe node xxxx
):
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 36808m (19%) 129766m (67%)
memory 106428318273 (9%) 271788956160 (25%)
ephemeral-storage 414572800 (0%) 12Gi (1%)
hugepages-1Gi 0 (0%) 0 (0%)
hugepages-2Mi 0 (0%) 0 (0%)
devices.kubevirt.io/kvm 1 1
devices.kubevirt.io/tun 1 1
devices.kubevirt.io/vhost-net 0 0
nvidia.com/gpu 1 1
Events: <none>
- Details of the NVIDIA GPU and drivers:
nvidia.com/cuda.driver-version.full=550.127.05
nvidia.com/cuda.driver-version.major=550
nvidia.com/cuda.driver-version.minor=127
nvidia.com/cuda.driver-version.revision=05
nvidia.com/cuda.driver.major=550
nvidia.com/cuda.driver.minor=127
nvidia.com/cuda.driver.rev=05
nvidia.com/cuda.runtime-version.full=12.4
nvidia.com/cuda.runtime-version.major=12
nvidia.com/cuda.runtime-version.minor=4
nvidia.com/cuda.runtime.major=12
nvidia.com/cuda.runtime.minor=4
nvidia.com/gfd.timestamp=1732076388
nvidia.com/gpu-driver-upgrade-state=upgrade-done
nvidia.com/gpu.compute.major=9
nvidia.com/gpu.compute.minor=0
nvidia.com/gpu.count=8
nvidia.com/gpu.deploy.container-toolkit=true
nvidia.com/gpu.deploy.dcgm=true
nvidia.com/gpu.deploy.dcgm-exporter=true
nvidia.com/gpu.deploy.device-plugin=true
nvidia.com/gpu.deploy.driver=true
nvidia.com/gpu.deploy.gpu-feature-discovery=true
nvidia.com/gpu.deploy.mig-manager=true
nvidia.com/gpu.deploy.node-status-exporter=true
nvidia.com/gpu.deploy.nvsm=
nvidia.com/gpu.deploy.operator-validator=true
nvidia.com/gpu.family=hopper
nvidia.com/gpu.machine=PowerEdge-XE9680
nvidia.com/gpu.memory=81559
nvidia.com/gpu.mode=compute
nvidia.com/gpu.present=true
nvidia.com/gpu.product=NVIDIA-H100-80GB-HBM3
nvidia.com/gpu.replicas=1
nvidia.com/gpu.sharing-strategy=none
nvidia.com/gpu.workload.config=container
nvidia.com/mig.capable=true
nvidia.com/mig.config=all-disabled
nvidia.com/mig.config.state=success
nvidia.com/mig.strategy=single
nvidia.com/mps.capable=false
nvidia.com/vgpu.present=false
Capacity:
cpu: 192
devices.kubevirt.io/kvm: 1k
devices.kubevirt.io/tun: 1k
devices.kubevirt.io/vhost-net: 1k
ephemeral-storage: 936643724Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 1056285364Ki
nvidia.com/gpu: 8
pods: 250
Allocatable:
cpu: 191500m
devices.kubevirt.io/kvm: 1k
devices.kubevirt.io/tun: 1k
devices.kubevirt.io/vhost-net: 1k
ephemeral-storage: 862137112786
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 1055134388Ki
nvidia.com/gpu: 8
pods: 250
System Info:
Note: Please comment if you’d like me to add further information!