HOw to make optimum make use of all available GPUs?

KiranAli · May 5, 2023, 1:04pm

I have the following code:

from optimum.onnxruntime import ORTModelForCausalLM
model_name = "databricks/dolly-v2-3b" model and faster inferences.
tokenizer = AutoTokenizer.from_pretrained(model_name, padding_side="left")
model = ORTModelForCausalLM.from_pretrained("databricks/dolly-v2-3b", export=True, provider="CUDAExecutionProvider")

Getting error:

     2023-05-05 04:39:44.132458586 [W:onnxruntime:, session_state.cc:1138 VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments.
     2023-05-05 04:39:44.653800957 [E:onnxruntime:, inference_session.cc:1532 operator()] Exception during initialization: /onnxruntime_src/onnxruntime/core/framework/bfc_arena.cc:368 void* 
     onnxruntime::BFCArena::AllocateRawInternal(size_t, bool, onnxruntime::Stream*, bool, onnxruntime::WaitNotificationFn) Failed to allocate memory for requested buffer of size 78643200

It breaks when nvidia-smi results are:

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A     33981      C   python3                                    9234MiB |
+---------------------------------------------------------------------------------------+
Fri May  5 13:24:59 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla V100-PCIE-16GB            On | 00000000:04:01.0 Off |                    0 |
| N/A   35C    P0               44W / 250W|  15874MiB / 16384MiB |     45%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  Tesla V100-PCIE-16GB            On | 00000000:04:02.0 Off |                    0 |
| N/A   32C    P0               24W / 250W|      4MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

My machine specs are 2 V100 16GB. I’m monitoring nividia-smi but I can’t see memory consumption up to 16GB on both GPUs. I highly doubt that it is using both GPUs.

regisss · May 5, 2023, 7:16pm

Hi @KiranAli! You’re actually right, ONNXRuntime does not enable at the moment to have any kind of parallelism to split the memory consumption on all the devices you have.

However, I did manage to export this model on a V100 instance doing as follows:

Make sure the version of the onnx Python package is NOT 1.14. You can actually export the model with version 1.14, but the validation will fail. You can install a previous version with pip install "onnx<1.14".

Export the model using Optimum CLI:

optimum-cli export onnx --model databricks/dolly-v2-3b output_dir

The memory overhead induced by the CLI is smaller and enables to complete the export on V100.

KiranAli · May 6, 2023, 11:31am

Thank @regisss. it works on the CPU but still doesn’t work on GPU.

regisss · May 6, 2023, 5:07pm

Ah yes sorry I forgot to specify the device

So yes, unfortunately, it fails with the following command:

optimum-cli export onnx --model databricks/dolly-v2-3b --device cuda output_dir

I tried exporting the model in fp16 to save some memory with:

optimum-cli export onnx --model databricks/dolly-v2-3b --device cuda --fp16 output_dir

It manages to export the decoder without key/value cache, however there is another memory allocation error when exporting the decoder with key/value cache. I don’t know if this is good enough for you.

We are working on enabling the export of a single decoder that will manage both cases, maybe that will help.

KiranAli · May 15, 2023, 5:43am

Hi @regisss, I successfully exported my model Dolly-v2-3b to onnx format using A100 40GB machine. It took 34GB to export model to onnx format. After importing model to onnx format, Inference speed was enhanced to 1-2s. But it still takes 34GB to load onnx model. This is too much of memory for a smaller model like dolly-v2-3b. Earlier it was only taking 6GB. Any pointers?

regisss · May 15, 2023, 7:21am

I agree there is something odd here, we’ll investigate it.

Earlier it was only taking 6GB

When you say “earlier”, do you mean with older versions of Optimum?

KiranAli · May 15, 2023, 10:15am

Nope. By earlier I meant running without onnx. Running model in its original form takes 6-7GB to run.

fxmarty · June 1, 2023, 1:34pm

Relevant: [Performance] Find out why the GPU memory allocated with `CUDAExecutionProvider` is much larger than the ONNX size · Issue #14526 · microsoft/onnxruntime · GitHub

Topic		Replies	Views
ONNX Flan-T5 Model OOM on GPU 🤗Optimum	2	2638	June 15, 2023
CUDA OOM when export a large model to ONNX 🤗Optimum	5	2143	June 26, 2025
How does the ONNX exporter work for GenerationModel with `past_key_value`? 🤗Optimum	9	2409	February 17, 2023
Cannot export to ONNX with optimum.onnxruntime 🤗Optimum	0	914	February 28, 2024
Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf Models	0	936	November 10, 2023

HOw to make optimum make use of all available GPUs?

Related topics