"meta-llama/Llama-3.2-90B-Vision-Instruct" continually crashing with "torch.OutOfMemoryError: CUDA out of memory. Tried to allocate"

twelcome · November 22, 2024, 3:09am

I’m trying to run “meta-llama/Llama-3.2-90B-Vision-Instruct” as a containerized vllm on openshift with a single NVIDIA v100 GPU (with 8 GPUs available).

The pod runs, however after about 2 minutes fails with a large error trace which includes the following error:

torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 224.00 MiB. GPU 0 has a total capacity of 79.10 GiB of which 80.31 MiB is free. Process 2718503 has 42.07 GiB memory in use. Process 3312520 has 36.93 GiB memory in use. Of the allocated memory 33.80 GiB is allocated by PyTorch, and 55.48 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

[rank0]:[W1121 09:26:13.404528307 CudaIPCTypes.cpp:16] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]

This suggests a shortage of memory resources for the GPU, but the GPU appears to be under-utilised and the server doesn’t seem to be heavily utilised.

output from nvidia gpu-check:

+-----------------------------------------------------------------------------------------+

| NVIDIA-SMI 550.127.05             Driver Version: 550.127.05     CUDA Version: 12.4     |

|-----------------------------------------+------------------------+----------------------+

| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |

| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |

|                                         |                        |               MIG M. |

|=========================================+========================+======================|

|   0  NVIDIA H100 80GB HBM3          On  |   00000000:DB:00.0 Off |                    0 |

| N/A   27C    P0             69W /  700W |       4MiB /  81559MiB |      0%      Default |

|                                         |                        |             Disabled |

+-----------------------------------------+------------------------+----------------------+

 

+-----------------------------------------------------------------------------------------+

| Processes:                                                                              |

|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |

|        ID   ID                                                               Usage      |

|=========================================================================================|

|  No running processes found                                                             |

+-----------------------------------------------------------------------------------------+

This leads me to wonder if the problem is with the configuration of the pod, including the arguments passed to the LLM.

container:

  name: pepper-vllm-90b

  image: pepperairepo.azurecr.io/vllm/vllm-openai:0.6.3

  imagePullPolicy: IfNotPresent

  args:

    - --model

    - meta-llama/Llama-3.2-90B-Vision-Instruct

    - --gpu-memory-utilization

    - "0.6"

    - --max_model_len

    - "4096"

    - --max_num_seqs

    - "4"

    - --enforce_eager

    - --tensor-parallel-size

    - "4"

  volumeMounts:

    cacheVolume: /root/.cache/huggingface # "/Model2" <- path from nfs where the model is stored

  containerPort: 8000

  shmSize: 16Gi

I’ve varied these parameters according to many suggestions on the internet, including changing --tensor-parallel-size. Nothing seems to stop the memory related crash.

Questions:

a) Has anyone encountered this problem before?
b) Any other troubleshooting approaches for diagnosing and fixing this issue?

For completeness, here is the full context around the memory failure:

INFO 11-21 09:26:12 selector.py:115] Using XFormers backend.

ESC[1;36m(VllmWorkerProcess pid=352)ESC[0;0m INFO 11-21 09:26:12 selector.py:115] Using XFormers backend.

ESC[1;36m(VllmWorkerProcess pid=353)ESC[0;0m INFO 11-21 09:26:12 multiproc_worker_utils.py:242] Worker exiting

ESC[1;36m(VllmWorkerProcess pid=351)ESC[0;0m INFO 11-21 09:26:12 multiproc_worker_utils.py:242] Worker exiting

ESC[1;36m(VllmWorkerProcess pid=352)ESC[0;0m INFO 11-21 09:26:12 multiproc_worker_utils.py:242] Worker exiting

INFO 11-21 09:26:12 multiproc_worker_utils.py:121] Killing local vLLM worker processes

Process SpawnProcess-1:

Traceback (most recent call last):

  File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap

    self.run()

  File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run

    self._target(*self._args, **self._kwargs)

  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 392, in run_mp_engine

    engine = MQLLMEngine.from_engine_args(engine_args=engine_args,

             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 141, in from_engine_args

    return cls(

           ^^^^

  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 78, in __init__

    self.engine = LLMEngine(*args,

                  ^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 335, in __init__

    self.model_executor = executor_class(

                          ^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/dist-packages/vllm/executor/distributed_gpu_executor.py", line 26, in __init__

    super().__init__(*args, **kwargs)

  File "/usr/local/lib/python3.12/dist-packages/vllm/executor/executor_base.py", line 47, in __init__

    self._init_executor()

  File "/usr/local/lib/python3.12/dist-packages/vllm/executor/multiproc_gpu_executor.py", line 111, in _init_executor

:

    self._run_workers("load_model",

  File "/usr/local/lib/python3.12/dist-packages/vllm/executor/multiproc_gpu_executor.py", line 192, in _run_workers

    driver_worker_output = driver_worker_method(*args, **kwargs)

                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 183, in load_model

    self.model_runner.load_model()

  File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1062, in load_model

    self.model = get_model(model_config=self.model_config,

                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/__init__.py", line 19, in get_model

    return loader.load_model(model_config=model_config,

           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/loader.py", line 398, in load_model

    model = _initialize_model(model_config, self.load_config,

            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/loader.py", line 175, in _initialize_model

    return build_model(

           ^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/loader.py", line 160, in build_model

    return model_class(config=hf_config,

           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/mllama.py", line 912, in __init__

    self.language_model = MllamaForCausalLM(

                          ^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/mllama.py", line 856, in __init__

    self.model = MllamaTextModel(config, cache_config, quant_config)

                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/mllama.py", line 794, in __init__

    LlamaDecoderLayer(config,

  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/llama.py", line 229, in __init__

    self.mlp = LlamaMLP(

               ^^^^^^^^^

  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/llama.py", line 71, in __init__

    self.gate_up_proj = MergedColumnParallelLinear(

                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/linear.py", line 424, in __init__

    super().__init__(input_size=input_size,

  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/linear.py", line 304, in __init__

    self.quant_method.create_weights(

  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/linear.py", line 122, in create_weights

    weight = Parameter(torch.empty(sum(output_partition_sizes),

                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/dist-packages/torch/utils/_device.py", line 79, in __torch_function__

    return func(*args, **kwargs)

           ^^^^^^^^^^^^^^^^^^^^^

torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 224.00 MiB. GPU 0 has a total capacity of 79.10 GiB of which 80.31 MiB is free. Process 2718503 has 42.07 GiB memory in use. Process 3312520 has 36.93 GiB memory in use. Of the allocated memory 33.80 GiB is allocated by PyTorch, and 55.48 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

[rank0]:[W1121 09:26:13.404528307 CudaIPCTypes.cpp:16] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]

Traceback (most recent call last):

  File "<frozen runpy>", line 198, in _run_module_as_main

  File "<frozen runpy>", line 88, in _run_code

  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 585, in <module>

    uvloop.run(run_server(args))

  File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 109, in run

    return __asyncio.run(

           ^^^^^^^^^^^^^^

  File "/usr/lib/python3.12/asyncio/runners.py", line 194, in run

    return runner.run(main)

    return __asyncio.run(

           ^^^^^^^^^^^^^^

  File "/usr/lib/python3.12/asyncio/runners.py", line 194, in run

    return runner.run(main)

           ^^^^^^^^^^^^^^^^

  File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run

    return self._loop.run_until_complete(task)

           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "uvloop/loop.pyx", line 1517, in uvloop.loop.Loop.run_until_complete

  File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 61, in wrapper

    return await main

           ^^^^^^^^^^

  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 552, in run_server

    async with build_async_engine_client(args) as engine_client:

               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__

    return await anext(self.gen)

           ^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 107, in build_async_engine_client

    async with build_async_engine_client_from_engine_args(

               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__

    return await anext(self.gen)

           ^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 194, in build_async_engine_client_from_engine_args

    raise RuntimeError(

RuntimeError: Engine process failed to start

/usr/lib/python3.12/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown

  warnings.warn('resource_tracker: There appear to be %d '

Some information on the physical server running this vllm:

OC nodes (oc describe node xxxx):


Allocated resources:

  (Total limits may be over 100 percent, i.e., overcommitted.)

  Resource                       Requests           Limits

  --------                       --------           ------

  cpu                            36808m (19%)       129766m (67%)

  memory                         106428318273 (9%)  271788956160 (25%)

  ephemeral-storage              414572800 (0%)     12Gi (1%)

  hugepages-1Gi                  0 (0%)             0 (0%)

  hugepages-2Mi                  0 (0%)             0 (0%)

  devices.kubevirt.io/kvm        1                  1

  devices.kubevirt.io/tun        1                  1

  devices.kubevirt.io/vhost-net  0                  0

  nvidia.com/gpu                 1                  1

Events:                          <none>

Details of the NVIDIA GPU and drivers:


                    nvidia.com/cuda.driver-version.full=550.127.05

                    nvidia.com/cuda.driver-version.major=550

                    nvidia.com/cuda.driver-version.minor=127

                    nvidia.com/cuda.driver-version.revision=05

                    nvidia.com/cuda.driver.major=550

                    nvidia.com/cuda.driver.minor=127

                    nvidia.com/cuda.driver.rev=05

                    nvidia.com/cuda.runtime-version.full=12.4

                    nvidia.com/cuda.runtime-version.major=12

                    nvidia.com/cuda.runtime-version.minor=4

                    nvidia.com/cuda.runtime.major=12

                    nvidia.com/cuda.runtime.minor=4

                    nvidia.com/gfd.timestamp=1732076388

                    nvidia.com/gpu-driver-upgrade-state=upgrade-done

                    nvidia.com/gpu.compute.major=9

                    nvidia.com/gpu.compute.minor=0

                    nvidia.com/gpu.count=8

                    nvidia.com/gpu.deploy.container-toolkit=true

                    nvidia.com/gpu.deploy.dcgm=true

                    nvidia.com/gpu.deploy.dcgm-exporter=true

                    nvidia.com/gpu.deploy.device-plugin=true

                    nvidia.com/gpu.deploy.driver=true

                    nvidia.com/gpu.deploy.gpu-feature-discovery=true

                    nvidia.com/gpu.deploy.mig-manager=true

                    nvidia.com/gpu.deploy.node-status-exporter=true

                    nvidia.com/gpu.deploy.nvsm=

                    nvidia.com/gpu.deploy.operator-validator=true

                    nvidia.com/gpu.family=hopper

                    nvidia.com/gpu.machine=PowerEdge-XE9680

                    nvidia.com/gpu.memory=81559

                    nvidia.com/gpu.mode=compute

 

                    nvidia.com/gpu.present=true

                    nvidia.com/gpu.product=NVIDIA-H100-80GB-HBM3

                    nvidia.com/gpu.replicas=1

                    nvidia.com/gpu.sharing-strategy=none

                    nvidia.com/gpu.workload.config=container

                    nvidia.com/mig.capable=true

                    nvidia.com/mig.config=all-disabled

                    nvidia.com/mig.config.state=success

                    nvidia.com/mig.strategy=single

                    nvidia.com/mps.capable=false

                    nvidia.com/vgpu.present=false

 

Capacity:

  cpu:                            192

  devices.kubevirt.io/kvm:        1k

  devices.kubevirt.io/tun:        1k

  devices.kubevirt.io/vhost-net:  1k

  ephemeral-storage:              936643724Ki

  hugepages-1Gi:                  0

  hugepages-2Mi:                  0

  memory:                         1056285364Ki

  nvidia.com/gpu:                 8

  pods:                           250

Allocatable:

  cpu:                            191500m

  devices.kubevirt.io/kvm:        1k

  devices.kubevirt.io/tun:        1k

  devices.kubevirt.io/vhost-net:  1k

  ephemeral-storage:              862137112786

  hugepages-1Gi:                  0

  hugepages-2Mi:                  0

  memory:                         1055134388Ki

  nvidia.com/gpu:                 8

  pods:                           250

System Info:

Note: Please comment if you’d like me to add further information!

John6666 · November 22, 2024, 4:03am

I’ve never seen this error in the forums either. Is there some unusual condition that causes it to occur?

after about 2 minutes fails

This is what I’m wondering. Why doesn’t it crash immediately? Maybe there’s something like a memory (VRAM) leak happening while the loop is running.

The following are very general solutions that I found through a search using your error message, so you may have already tried them, but I’m including them here for reference.

Topic		Replies	Views
CUDA out of memory on multi-GPU 🤗Transformers	1	2649	March 6, 2024
Running into OOM on GPU with quantized llama-3-8b for text generation inference Models	0	495	June 29, 2024
Cuda Out of Memory when fine tuning llm model 🤗Transformers	3	1166	May 7, 2024
Multi-GPU inference with accelerate Beginners	0	1713	October 19, 2023
Fine tune Meta-Llama-3.1-8B OOM error after the 1st training step Models	0	162	September 6, 2024

"meta-llama/Llama-3.2-90B-Vision-Instruct" continually crashing with "torch.OutOfMemoryError: CUDA out of memory. Tried to allocate"

Related topics