Spaces and "Building" stuck, infra side issue and how to troubleshoot further?

Hello. I have 2 private spaces (one on my account, other org). I have them both set on sleep timers. They have spun up OK in the past if go to sleep. Since this AM, both are stuck in “Building” status. The Build logs show image pushed and exporting cache successfully. The Container Logs show start up (below), but the app is not starting (have some logging which typically spits out if working). I have done the Restart Space and Factory Reboot. Also tried changing the Space Hardware to see if was hardware resource contention. Still “Building …”

(1) I see a few comments referring to build queue, is this indicated somewhere in logs or other to check out? Sorry if I missed where it was indicated. (EDIT: see the “Build Queued” at the top of the Build Logs, so might just need to wait for resource availability)

(2) the huggingface service status seems to indicate things are OK. Is something going on on the infra side?

Reading some of the threads, it sounds like waiting until something is updated on HG side is the typical resolution. Anything else I can do to get it working again?

Thanks!

e.g.:

===== Application Startup at 2023-09-08 19:17:32 =====

Not sure if it helps, but have received a different message when I try to do the factory reboot. Still getting through the build queue successfully and the

Container Logs (after long waiting period):

Error: Failed to load logs: Not Found. Logs are persisted for 30 days after the Space stops running.

Build Logs tail:

→ COPY --link --chown=1000 --from=lfs /app /home/user/app
DONE 0.0s

→ COPY --link --chown=1000 ./ /home/user/app
DONE 0.0s

→ Pushing image
DONE 23.4s

→ Exporting cache
DONE 6.9s

sorry, could you please try duplicating your Space and see if the new Space builds and run successfully?

Thanks @radames . I just cloned, it downloaded all of the model info and started up!

Which I am guessing indicates something in HF_HOME got corrupted or other (configured as /data/.huggingface in the Space). Is there a way of deleting that folder (removing from the variables, running, and adding back)? Or should I just blow away the old Spaces and go with the clones?

(EDIT2: I see option to “Remove current storage” in Settings, get this error when try to add back storage, might just take time to reset

“Error while upgrading the persistent storage: An error happened while upgrading your Space’s storage. Status code: 409”)

Thanks @ecarr-compoze for the feedback, I think it’s the same issue related to this

We’ll investigate with infra next week cc @chris-rannou

@radames Hello, I also have the same question, space: MetaGPT - a Hugging Face Space by deepwisdom Can you help me solve it

I tried cloning Stable Cascade - a Hugging Face Space by multimodalart. The build process stopped with the error below. It looks like a DNS error internal to HF’s hosting. I can’t find a way to retry the build. Advice?

ile "/home/user/.local/lib/python3.10/site-packages/gradio/components/button.py", line 65, in __init__
    self.icon = self.move_resource_to_block_cache(icon)
  File "/home/user/.local/lib/python3.10/site-packages/gradio/blocks.py", line 249, in move_resource_to_block_cache
    temp_file_path = processing_utils.save_url_to_cache(
  File "/home/user/.local/lib/python3.10/site-packages/gradio/processing_utils.py", line 193, in save_url_to_cache
    with httpx.stream("GET", url, follow_redirects=True) as r, open(
  File "/usr/local/lib/python3.10/contextlib.py", line 135, in __enter__
    return next(self.gen)
  File "/home/user/.local/lib/python3.10/site-packages/httpx/_api.py", line 160, in stream
    with client.stream(
  File "/usr/local/lib/python3.10/contextlib.py", line 135, in __enter__
    return next(self.gen)
  File "/home/user/.local/lib/python3.10/site-packages/httpx/_client.py", line 870, in stream
    response = self.send(
  File "/home/user/.local/lib/python3.10/site-packages/httpx/_client.py", line 914, in send
    response = self._send_handling_auth(
  File "/home/user/.local/lib/python3.10/site-packages/httpx/_client.py", line 942, in _send_handling_auth
    response = self._send_handling_redirects(
  File "/home/user/.local/lib/python3.10/site-packages/httpx/_client.py", line 979, in _send_handling_redirects
    response = self._send_single_request(request)
  File "/home/user/.local/lib/python3.10/site-packages/httpx/_client.py", line 1015, in _send_single_request
    response = transport.handle_request(request)
  File "/home/user/.local/lib/python3.10/site-packages/httpx/_transports/default.py", line 232, in handle_request
    with map_httpcore_exceptions():
  File "/usr/local/lib/python3.10/contextlib.py", line 153, in __exit__
    self.gen.throw(typ, value, traceback)
  File "/home/user/.local/lib/python3.10/site-packages/httpx/_transports/default.py", line 86, in map_httpcore_exceptions
    raise mapped_exc(message) from exc
httpx.ConnectError: [Errno -3] Temporary failure in name resolution

I was using docker + gradio. In the Dockerfile i placed this:

Set Gradio server name to bind to 0.0.0.0 for external access

ENV GRADIO_SERVER_NAME=“0.0.0.0”

and it worked. If you dont have docker, i guess in your app.py maybe write

iface.launch(server_name=“0.0.0.0”)

1 Like