I got OOM when exporting a large model to ONNX. I wonder how Optimum handles this issue.
Here are my settings
- The code works for a smaller model with fewer parameters, so the error is due to the model size.
- Not able to export the model on the CPU because of fp16.
- The pure model takes 20GB of CUDA memory, and the total GPU capacity is 80G. ( It seems 10x memory will be consumed for exporting a small model. 2GB->20GB)
- Running multiple forward passes before export won鈥檛 cause any trouble.
- A greedy search is implemented in the graph to generate 32 tokens. A lot of intermediate
past_key_values
are cached.