Cuda out of memory error when using Inference API

talids · August 11, 2022, 6:49am

I’m on the organization lab plan and trying to use GPU-accelerated inference. I am using our private Question-Answering models and sending a data set of 300 questions and contexts one-by-one to the HF API. When using GPU-Accelerated Inference, after sending few samples and receiving their answers, I am getting the following errors:

‘There was an inference error: CUDA error: out of memory\nCUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.\nFor debugging consider passing CUDA_LAUNCH_BLOCKING=1.’]

‘There was an inference error: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling cublasCreate(handle)’

Any leads on this would be really helpful.

Topic		Replies	Views
Is this CUDA memory error on Inference API coming from HuggingFace or Google Collab? Beginners	0	613	July 20, 2021
'CUDA error: all CUDA-capable devices are busy or unavailable" when using 🤗Accelerate	0	1991	March 14, 2022
[HELP] CUDA Error: out of memory with facebook/bart-large-mnli Beginners	2	740	May 2, 2023
CUDA error for inference on GPU instance Amazon SageMaker	2	770	May 16, 2023
Cuda out of memory while using Trainer API Beginners	1	1766	October 20, 2021

Cuda out of memory error when using Inference API

Related topics