Donut inference at production

WaterKnight · November 11, 2022, 3:41pm

Hi,

I have a container with an API implemented with FastAPI to do predictions with the model

If I send requests sequentially everything works well. When I send them in parallel with joblib I start getting CUDA errors.

To try to solve this I disabled workers in uvicorn and I added Cuda Stream to try to get this working, but i still get the issue

Topic		Replies	Views
Having issues with running parallel, independent inferences on multiple GPUs Beginners	0	234	September 10, 2024
'CUDA error: all CUDA-capable devices are busy or unavailable" when using 🤗Accelerate	0	1983	March 14, 2022
Finetuned Donut model taking too much time on local machine for inference , around 5 minutes 🤗Transformers	3	946	January 4, 2024
API Rest with several models loaded using GPU but not at same time Beginners	1	401	June 10, 2021
Model inferencing is blocking the main fastapi thread Intermediate	1	49	March 28, 2025