I’m on the organization lab plan and trying to use GPU-accelerated inference for facebook/bart-large-mnli model. I am using this model for text classification and passing 10 candidate labels. When using GPU-Accelerated Inference I am getting error 400 Bad request. It works without GPU but the latency is not acceptable.
This is the error message -
"error": "CUDA error: out of memory\nCUDA kernel errors might be asynchronously reported at****some other API call,so the stacktrace below might be incorrect.\nFor debugging consider passing CUDA_LAUNCH_BLOCKING=1."
Any leads on this would be really helpful.