HOw to make optimum make use of all available GPUs?

I have the following code:

from optimum.onnxruntime import ORTModelForCausalLM
model_name = "databricks/dolly-v2-3b" model and faster inferences.
tokenizer = AutoTokenizer.from_pretrained(model_name, padding_side="left")
model = ORTModelForCausalLM.from_pretrained("databricks/dolly-v2-3b", export=True, provider="CUDAExecutionProvider")

Getting error:

     2023-05-05 04:39:44.132458586 [W:onnxruntime:, session_state.cc:1138 VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments.
     2023-05-05 04:39:44.653800957 [E:onnxruntime:, inference_session.cc:1532 operator()] Exception during initialization: /onnxruntime_src/onnxruntime/core/framework/bfc_arena.cc:368 void* 
     onnxruntime::BFCArena::AllocateRawInternal(size_t, bool, onnxruntime::Stream*, bool, onnxruntime::WaitNotificationFn) Failed to allocate memory for requested buffer of size 78643200

It breaks when nvidia-smi results are:

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A     33981      C   python3                                    9234MiB |
+---------------------------------------------------------------------------------------+
Fri May  5 13:24:59 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla V100-PCIE-16GB            On | 00000000:04:01.0 Off |                    0 |
| N/A   35C    P0               44W / 250W|  15874MiB / 16384MiB |     45%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  Tesla V100-PCIE-16GB            On | 00000000:04:02.0 Off |                    0 |
| N/A   32C    P0               24W / 250W|      4MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

My machine specs are 2 V100 16GB. I’m monitoring nividia-smi but I can’t see memory consumption up to 16GB on both GPUs. I highly doubt that it is using both GPUs.

Hi @KiranAli! You’re actually right, ONNXRuntime does not enable at the moment to have any kind of parallelism to split the memory consumption on all the devices you have.

However, I did manage to export this model on a V100 instance doing as follows:

  • Make sure the version of the onnx Python package is NOT 1.14. You can actually export the model with version 1.14, but the validation will fail. You can install a previous version with pip install "onnx<1.14".
  • Export the model using Optimum CLI:
    optimum-cli export onnx --model databricks/dolly-v2-3b output_dir
    

The memory overhead induced by the CLI is smaller and enables to complete the export on V100.

Thank @regisss. it works on the CPU but still doesn’t work on GPU.

Ah yes sorry I forgot to specify the device :frowning:

So yes, unfortunately, it fails with the following command:

optimum-cli export onnx --model databricks/dolly-v2-3b --device cuda output_dir

I tried exporting the model in fp16 to save some memory with:

optimum-cli export onnx --model databricks/dolly-v2-3b --device cuda --fp16 output_dir

It manages to export the decoder without key/value cache, however there is another memory allocation error when exporting the decoder with key/value cache. I don’t know if this is good enough for you.

We are working on enabling the export of a single decoder that will manage both cases, maybe that will help.

Hi @regisss, I successfully exported my model Dolly-v2-3b to onnx format using A100 40GB machine. It took 34GB to export model to onnx format. After importing model to onnx format, Inference speed was enhanced to 1-2s. But it still takes 34GB to load onnx model. This is too much of memory for a smaller model like dolly-v2-3b. Earlier it was only taking 6GB. Any pointers?

I agree there is something odd here, we’ll investigate it.

Earlier it was only taking 6GB

When you say “earlier”, do you mean with older versions of Optimum?

Nope. By earlier I meant running without onnx. Running model in its original form takes 6-7GB to run.

Relevant: [Performance] Find out why the GPU memory allocated with `CUDAExecutionProvider` is much larger than the ONNX size · Issue #14526 · microsoft/onnxruntime · GitHub