503 Service Unavailable hitting multiple major Image-to-3D Spaces (TripoSR, InstantMesh, LGM) via Gradio Client

Hello,

I am currently trying to use the gradio_client Python API to generate 3D models, but I am receiving a continuous 503 Service Unavailable error across multiple prominent, unrelated Image-to-3D spaces.

I ran a test script against the following spaces, and all of them are returning the exact same 503 Server Error:

  • stabilityai/TripoSR

  • TencentARC/InstantMesh

  • ashawkey/LGM

  • Zhengyi/CRM

Here is a snippet of the traceback I am receiving:

python

httpx.HTTPStatusError: Server error ‘503 Service Unavailable’ for url ‘https://huggingface.co/api/spaces/stabilityai/TripoSR’

Because this is affecting various different accounts and spaces simultaneously, it seems to be a broader infrastructure issue with Gradio API endpoints or Spaces serving. Is there currently a known outage affecting the API accessibility of Spaces, and is there an ETA on when it might be resolved?

Thank you!

1 Like

All of those Spaces are crashing with runtime errors… so naturally, they shouldn’t work via the API either.

While the method for fixing the code varies depending on the Space, only the author (or someone else with the necessary permissions) can apply the fix…

1 Like

Thanks for the insight on the runtime errors. You’re right that the Spaces are currently in a crash state, but I suspect this is a broader infrastructure issue rather than individual code errors by the authors for a few reasons:

  1. Statistical Anomaly: It is highly unlikely that four separate research teams (Stability AI, Tencent, etc.) all broke their unrelated codebases at the exact same hour yesterday, & today.

  2. External Failures: I’ve confirmed that Microsoft Copilot Labs’ 3D generation is also currently failing. Since Copilot’s 3D features and TripoSR share the same architectural DNA (Tripo AI), this points to an upstream API or model weight server failure.

  3. The ‘3D 2.0’ Migration: Tripo AI officially launched their ‘AI 3D 2.0’ (Smart Mesh P1.0) on April 1st. It’s very possible that a major deprecation of the v1 endpoints or a change in how model weights are served has orphaned these older Spaces.

  4. Environment Drift: The logs across these Spaces are showing common onnxruntime and libcudart mismatches. This usually happens when Hugging Face updates its base ZeroGPU or Inference Endpoint environment, breaking any Space that doesn’t have strictly pinned dependencies.

While only the authors can ‘fix’ the code to match the new environment, the fact that it’s a multi-platform blackout suggests the platform/upstream dependency changed under them.

1 Like

I’ve reviewed the Spaces code history both manually and using AI, and it seems that for quite some time now, the code has been written in a way that no longer works when rebuilt in the “current” HF Spaces environment.
This is partly due to changes on the HF Spaces runtime side, and as you mentioned, external dependencies—such as the libraries used by Spaces—also play a significant role.

Therefore, I believe modifying the code itself is necessary regardless of the cause of the sudden reboots.

Aside from your hypothesis, here’s what I can think of as a cause for the sudden forced restarts of multiple Spaces: [Bug] Space stuck in 503 loop after waking from pause — cannot restart
From the serious HF Spaces-related issue (Stuck at space problem - #3 by hysts
) two weeks ago up to now, various Spaces-related problems have been occurring one after another, and this could be considered part of that…:downcast_face_with_sweat:

1 Like

Based on the issues I identified while modifying my own Spaces, the following are the minimum necessary concrete fixes:


All four currently show Runtime error, but they do not fail for one shared code reason. What happened is closer to this: a platform-side event likely forced cold starts, and those cold starts exposed four different latent startup problems. So the right repair strategy is not “apply one global workaround,” but “make each repo boot cleanly in the current Spaces environment with the smallest justified diff.” The four buckets are: missing onnxruntime after rembg import for TripoSR and InstantMesh, a dead upstream repo id for CRM, and a native CUDA runtime mismatch for LGM. HF’s current Spaces config still supports explicit python_version pinning, and current ZeroGPU docs list 3.10.13 and 3.12.12 as supported Python versions, so version pinning is still part of the stabilization story. (Hugging Face)

The key principle is this: even if the trigger was a restart, unpause, rebuild, scheduler issue, or temporary API problem, these fixes are still needed because each current repo has a deterministic startup failure in its own code path. In other words, even if the platform caused the failure to become visible, the repo still has to be made bootable. That is why I would keep the patches minimal and specific instead of doing large framework upgrades first. pip has become stricter over time, but none of the four currently exposed failures are primarily “requirements syntax” bugs. They are startup dependency and binary/runtime mismatches. (pip)

What I would change first, in order

  1. TripoSR: add onnxruntime.
  2. InstantMesh: add onnxruntime, and pin python_version: 3.10.13 in README.
  3. CRM: replace the dead stabilityai/stable-diffusion-2-1-base scheduler source with sd2-community/stable-diffusion-2-1-base.
  4. LGM: add nvidia-cuda-runtime-cu11, then preload libcudart.so.11.0 before importing the compiled extension. This is the smallest targeted fix for the current public crash, but LGM is the only one where I would keep an explicit fallback plan in mind if the first patch is not enough. (Hugging Face)

1) TripoSR

Why it crashes now

The current app imports rembg at module import time, and the current requirements.txt includes bare rembg but not onnxruntime. The public runtime traceback for this Space shows exactly that failure path: import rembg → import onnxruntime as ort → ModuleNotFoundError: No module named 'onnxruntime'. The README already pins python_version: 3.10.13, so Python drift is not the first thing to fix here. (Hugging Face)

Smallest patch

requirements.txt

 omegaconf==2.3.0
 Pillow==10.1.0
 einops==0.7.0
 transformers==4.35.0
 trimesh==4.0.5
 rembg
+onnxruntime
 huggingface-hub
 gradio

Why this patch is needed even if the trigger was external

Because the failure is deterministic at startup. The current repo asks Python to import rembg before the app can even finish importing, and the public crash shows that the installed environment does not contain onnxruntime. A platform restart may have exposed it, but a clean cold start will keep hitting the same line until onnxruntime is present. This is why I would not start by upgrading Gradio or Torch here. The smallest repair is to add the missing package that the current code path actually imports. (Hugging Face)

Why I am not making a bigger first patch

You could switch to a newer rembg extra layout, but that is not the smallest safe move for this repo. The exposed failure is not “wrong Gradio API,” not “wrong Torch version,” and not “wrong Python version.” It is specifically “onnxruntime is missing.” So the one-line fix above is the cleanest first pass. (Hugging Face)


2) InstantMesh

Why it crashes now

This Space has the same primary failure as TripoSR. app.py imports rembg, and the preprocessing path creates a rembg session. requirements.txt still lists bare rembg, and the public runtime traceback again shows ModuleNotFoundError: No module named 'onnxruntime'. Unlike TripoSR, its README metadata does not currently specify python_version, even though HF supports pinning it in README YAML. (Hugging Face)

Smallest patch

README.md

 title: InstantMesh
 emoji:
 colorFrom: indigo
 colorTo: green
 sdk: gradio
 sdk_version: 4.26.0
+python_version: 3.10.13
 app_file: app.py
 pinned: false
 short_description: Create a 3D model from an image in 10 seconds!
 license: apache-2.0

requirements.txt

 torch==2.1.0
 torchvision==0.16.0
 torchaudio==2.1.0
 pytorch-lightning==2.1.2
 einops
 omegaconf
 deepspeed
 torchmetrics
 webdataset
 accelerate
 tensorboard
 PyMCubes
 trimesh
 rembg
+onnxruntime
 transformers==4.34.1
 diffusers==0.19.3
 bitsandbytes
 imageio[ffmpeg]
 xatlas
 plyfile
 xformers==0.0.22.post7
 git+https://github.com/NVlabs/nvdiffrast/
 huggingface-hub

Why this patch is needed even if the trigger was external

Again, because the current startup path already contains the failure. The app imports rembg before the UI is ready, and the publicly reported runtime failure is the missing onnxruntime import. The Python pin is a separate hardening step: HF lets Spaces pin python_version, and current ZeroGPU docs explicitly list 3.10.13 as supported. Even if the platform restart is what made the breakage visible, keeping Python fixed removes one more moving part from future cold starts. (Hugging Face)

What I would not do first

I would not begin by mass-upgrading the whole dependency stack. There is already a community PR that proposes a larger cleanup including numpy<2.0.0, Pillow==10.4.0, newer gradio, and simplified requirements. That may be useful later, but the smallest justified first repair is still “add onnxruntime and pin Python.” (Hugging Face)

If the first patch boots but then fails later

The next smallest hardening step is:

+numpy<2.0.0
+Pillow==10.4.0

I would only do that after confirming that the startup blocker moved past rembg/onnxruntime. The reason is simple: fix the deterministic boot failure first, then deal with second-order runtime drift. (Hugging Face)


3) CRM

Why it crashes now

The current public runtime error is very specific. The Space tries to build a DDIMScheduler from stabilityai/stable-diffusion-2-1-base, and that repo id no longer resolves publicly for the needed scheduler config. The crash trace points into model/crm/model.py at the scheduler initialization. At the same time, app.py defaults --device to "cuda" and moves the model there during startup, which is an additional fragility point once the scheduler problem is fixed. (Hugging Face)

Smallest patch

model/crm/model.py

-self.scheduler = DDIMScheduler.from_pretrained(
-    "stabilityai/stable-diffusion-2-1-base",
-    subfolder="scheduler",
-)
+self.scheduler = DDIMScheduler.from_pretrained(
+    "sd2-community/stable-diffusion-2-1-base",
+    subfolder="scheduler",
+)

Why this patch is needed even if the trigger was external

Because the current repo points at a model id that no longer works for this code path, and the public crash trace shows exactly that path failing. sd2-community/stable-diffusion-2-1-base exists, and its repo contains scheduler/scheduler_config.json, which is the file CRM is trying to load. So this is not a speculative change. It is a direct one-line replacement for the dead dependency that the current startup path is trying to read. Even if a platform restart is what surfaced the error, any future cold start will keep failing until the repo id is replaced. (Hugging Face)

Optional but very cheap second line

app.py

-parser.add_argument("--device", type=str, default="cuda")
+parser.add_argument("--device", type=str, default="cuda" if torch.cuda.is_available() else "cpu")

Why I would add that second line

The scheduler fix is the primary repair. But once the app gets past that point, startup still does model = model.to(args.device) and passes device=args.device into the pipeline constructor. Right now that default is hard-coded to "cuda". So if the Space is restarted on a CPU-backed environment, or on a GPU path that is temporarily unavailable, the next boot can fail later in startup. That one-line default makes the app more robust without changing its interface or behavior when CUDA is actually available. (Hugging Face)

What I would not do first

I would not start by adding tokens or auth logic. The current public problem is not “this repo is gated but otherwise correct.” The practical issue is that the code points at a repo id that no longer works for the scheduler path, and a community mirror already exposes the file CRM needs. So the smallest valid fix is to swap the source, not to add authentication plumbing. (Hugging Face)


4) LGM

Why it crashes now

LGM is the outlier. The public runtime error is not a missing Python dependency. It is a compiled-extension failure: the Space downloads its checkpoint, installs a local wheel named diff_gaussian_rasterization-0.0.0-cp310-cp310-linux_x86_64.whl, and then crashes importing that extension because libcudart.so.11.0 is missing. The README already pins python_version: 3.10.13, so Python drift is not the first issue here. The current code also initializes most of the heavy model stack at startup, not lazily. (Hugging Face)

Smallest targeted patch

requirements.txt

 torch==2.4.0
 xformers
 numpy
 tyro
 diffusers
 dearpygui
 einops
 accelerate
 gradio
 imageio
 imageio-ffmpeg
 lpips
 matplotlib
 packaging
 Pillow
 pygltflib
 rembg[gpu,cli]
+nvidia-cuda-runtime-cu11
 rich
 safetensors
 scikit-image
 scikit-learn
 scipy
 tqdm
 transformers
 trimesh
 kiui >= 0.2.3
 xatlas
 roma
 plyfile

app.py
Add this before from core.models import LGM:

+import ctypes
+import site
+
+for sp in site.getsitepackages():
+    cudart = os.path.join(sp, "nvidia", "cuda_runtime", "lib", "libcudart.so.11.0")
+    if os.path.exists(cudart):
+        ctypes.CDLL(cudart)
+        break

Why this patch is needed even if the trigger was external

Because the current public crash is already precise: the installed compiled extension cannot find libcudart.so.11.0. NVIDIA publishes nvidia-cuda-runtime-cu11 on PyPI as “CUDA Runtime native Libraries,” and this patch preloads the exact library the extension says it is missing before the extension import happens. That is the smallest repo-side change that directly matches the currently exposed failure. A platform-side restart may have exposed it, but once the process restarts, the same binary import will keep failing until the CUDA runtime library problem is addressed. (Hugging Face)

Important honesty note

This is the only one of the four where I would not promise the first patch is enough. It is the smallest targeted fix for the current public error, but native wheels can fail for more than one reason. If the wheel was built against a runtime/ABI combination that still does not match the current Spaces environment, then the next repair is no longer a one-liner. At that point, the smallest real fix becomes either:

  • rebuild that extension for the current runtime, or
  • move the Space to Docker so CUDA and the extension are under your control.
    HF’s current ZeroGPU docs also make clear that ZeroGPU is its own environment with H200-backed shared GPU slices and specific supported versions, so binary assumptions that worked on an older setup can stop being valid after a cold restart. (Hugging Face)

What I would not do first

I would not start by upgrading Gradio, Torch, or the whole app stack just to chase this one error. The current public failure happens before any of that becomes the main issue: it dies when the compiled rasterizer tries to load _C and cannot find libcudart.so.11.0. Solve the explicit binary import error first. Then, if it boots and another error appears, fix that next one. (Hugging Face)


A compact “do this now” version

If I were patching these repos in the smallest reasonable way, I would do exactly this:

TripoSR

+ onnxruntime

InstantMesh

README.md:
+ python_version: 3.10.13

requirements.txt:
+ onnxruntime

CRM

- "stabilityai/stable-diffusion-2-1-base"
+ "sd2-community/stable-diffusion-2-1-base"

Optional second line:

- default="cuda"
+ default="cuda" if torch.cuda.is_available() else "cpu"

LGM

requirements.txt:
+ nvidia-cuda-runtime-cu11

and preload libcudart.so.11.0 before importing core.models. (Hugging Face)

Why I think these are the right first patches

Because they match the actual currently exposed startup failures, not a guessed historical failure, and because they keep the diffs small:

  • TripoSR: missing Python package.
  • InstantMesh: same missing package, plus missing Python pin.
  • CRM: dead external repo id.
  • LGM: missing CUDA runtime for a compiled extension. (Hugging Face)
1 Like

These changes are intended to lock the environment to an older version, rather than keeping up with the latest changes in the surrounding ecosystem. They are solely intended to resolve the issue where Spaces fails to launch.

2 Likes

Thanks for this incredible deep dive, John. That’s exactly the technical post mortem I was looking for.

I’m going to attempt a local patch (duplicating the spaces and applying your specific requirements.txt and app.py fixes) to get my project moving again. Your description of ‘latent fragility’ is fascinating it perfectly explains why these worked for months and then suddenly collapsed during a cold start.

It’s a bit of a wake up call regarding how fragile the current Image-to-3D stack is, especially with the CRM repo ID issues and the LGM CUDA mismatches. Hopefully, the authors see your work and merge these fixes soon so the community can use the main nodes again. Really appreciate the help!

1 Like