I’m on the organization lab plan and trying to use GPU-accelerated inference. I am using our private Question-Answering models and sending a data set of 300 questions and contexts one-by-one to the HF API. When using GPU-Accelerated Inference, after sending few samples and receiving their answers, I am getting the following errors:
‘There was an inference error: CUDA error: out of memory\nCUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.\nFor debugging consider passing CUDA_LAUNCH_BLOCKING=1.’]
‘There was an inference error: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling cublasCreate(handle)
’
Any leads on this would be really helpful.