Donut inference at production


I have a container with an API implemented with FastAPI to do predictions with the model

If I send requests sequentially everything works well. When I send them in parallel with joblib I start getting CUDA errors.

To try to solve this I disabled workers in uvicorn and I added Cuda Stream to try to get this working, but i still get the issue