Docker space rebuilds with "NotFound" error

I have a private Space with Docker runtime. The main purpose of the space is splitting a file into sentences and then computing embeddings of each sentence using intfloat/multilingual-e5-large model. Usually it’s a 7-10 MB file in PDF, containing around 22k sentences.

After loading the model and splitting all the sentences, during computation of the embeddings, the container starts rebuilding and I get an error in logs:

error: code = NotFound desc = an error occurred when try to find container "ebc9b54a2889b5b7d21cc4462dfaf493c731c625da1f6b061e44f2d26806197b": not found

I tried using T4 small and T4 medium units, error happens on both of them. When running the same container locally on my laptop, no errors occur. For comparison, T4 medium has 8 vCPU, 30GB RAM and 16GB VRAM, while my laptop has 8 cores, 16GB RAM and only 6GB VRAM. And at most 4GB of RAM and 3.7GB of VRAM are used during the computation, so I’m quite sure it’s not because of resource limits.

The docker image is built on top of nvidia/cuda:11.7.1 image. What could cause the error?

Also, per documentation, volume limit of a space is 50GB (ephemeral, but still). My image is ~8GB, the model is another 2.4GB, so it surely can’t go over 50GB total.

Hi @andrewyazura, apologizes for the inconvenience. Could you please confirm if you’re using persistent storage on your Spaces?

Hi, @radames! I’m using ephemeral storage, but all the embeddings are stored in a remote PostgreSQL database outside of the space. Only the app and the e5 model are on the ephemeral storage

Considering that you’ve tested locally for potential OOM issues, and this error message it’s not very clear, it could be something on our internal infra. One last test, you could you try it again duplicating your Space and try to run from a clean state? Otherwise we might need help from the infra team.

Yes, I’ve duplicated the Space, got the same result, but this time there a little more logs

There is an error at the beginning of application logs, but it doesn’t affect the logic afterwards, it’s a problem with my streamlit code that I haven’t fixed yet.

Also, right after Preprocessing progress bar there is a Stopping... message that I’ve never seen locally and I’m not sure why it comes before the embeddings computation.

The repo and the space are private, but if it helps, I can make them public

thanks for the logs, I’ll loop @chris-rannou and @XciD they can look internally for more context.

Thank you for the assistance, I’ll be waiting

Hi @andrewyazura,

Do you have an error message at the top of your space page once the startup failed ?

How long does the startup take ? Spaces are limited to a 30 min startup duration before they’re considered as failing, a space is started once it can handle http requests.

Could you share the space name ?

Hi @chris-rannou,

  1. No, there is no error on the page itself, the page becomes blank and indicator at the top changes from Running on ... to Building....
  2. Building takes at most 3-4 minutes, once it’s built it becomes usable instantly.
  3. andrewyazura/psychology-qa-streamlit, it is currently private, do you need me to make it public?

Are you interacting with the Space repository at runtime ? Committing or updating secrets or variables ?
It seem the Space was Running for the last 36 hours.

Indeed checking the resource usage the Space does not seem to get over allocated resources limits.

Did you attempt a factory reboot of the space ?

Yes, right now I’m commiting to the space, but at the time when the error happened, all I did was just start text processing and waited for it to complete, without making any changes to variables, secrets nor repo. By text processing I mean splitting it into sentences and then computing embeddings for batches of the sentences.

I did attempt a factory reboot, but only once and it didn’t help. Creating a duplicate space did not help either.

I’ve just noticed that running intfloat/multilingual-e5-large on a single sentence (a query for semantic search) works fine, but when running it on batches of 32 over a large file it stops and starts rebuilding the container. I’ll try reducing the batch size and maybe that will fix it for me, but that is still strange, as it works fine on my laptop.