Based on the issues I identified while modifying my own Spaces, the following are the minimum necessary concrete fixes:
All four currently show Runtime error, but they do not fail for one shared code reason. What happened is closer to this: a platform-side event likely forced cold starts, and those cold starts exposed four different latent startup problems. So the right repair strategy is not âapply one global workaround,â but âmake each repo boot cleanly in the current Spaces environment with the smallest justified diff.â The four buckets are: missing onnxruntime after rembg import for TripoSR and InstantMesh, a dead upstream repo id for CRM, and a native CUDA runtime mismatch for LGM. HFâs current Spaces config still supports explicit python_version pinning, and current ZeroGPU docs list 3.10.13 and 3.12.12 as supported Python versions, so version pinning is still part of the stabilization story. (Hugging Face)
The key principle is this: even if the trigger was a restart, unpause, rebuild, scheduler issue, or temporary API problem, these fixes are still needed because each current repo has a deterministic startup failure in its own code path. In other words, even if the platform caused the failure to become visible, the repo still has to be made bootable. That is why I would keep the patches minimal and specific instead of doing large framework upgrades first. pip has become stricter over time, but none of the four currently exposed failures are primarily ârequirements syntaxâ bugs. They are startup dependency and binary/runtime mismatches. (pip)
What I would change first, in order
- TripoSR: add
onnxruntime.
- InstantMesh: add
onnxruntime, and pin python_version: 3.10.13 in README.
- CRM: replace the dead
stabilityai/stable-diffusion-2-1-base scheduler source with sd2-community/stable-diffusion-2-1-base.
- LGM: add
nvidia-cuda-runtime-cu11, then preload libcudart.so.11.0 before importing the compiled extension. This is the smallest targeted fix for the current public crash, but LGM is the only one where I would keep an explicit fallback plan in mind if the first patch is not enough. (Hugging Face)
1) TripoSR
Why it crashes now
The current app imports rembg at module import time, and the current requirements.txt includes bare rembg but not onnxruntime. The public runtime traceback for this Space shows exactly that failure path: import rembg â import onnxruntime as ort â ModuleNotFoundError: No module named 'onnxruntime'. The README already pins python_version: 3.10.13, so Python drift is not the first thing to fix here. (Hugging Face)
Smallest patch
requirements.txt
omegaconf==2.3.0
Pillow==10.1.0
einops==0.7.0
transformers==4.35.0
trimesh==4.0.5
rembg
+onnxruntime
huggingface-hub
gradio
Why this patch is needed even if the trigger was external
Because the failure is deterministic at startup. The current repo asks Python to import rembg before the app can even finish importing, and the public crash shows that the installed environment does not contain onnxruntime. A platform restart may have exposed it, but a clean cold start will keep hitting the same line until onnxruntime is present. This is why I would not start by upgrading Gradio or Torch here. The smallest repair is to add the missing package that the current code path actually imports. (Hugging Face)
Why I am not making a bigger first patch
You could switch to a newer rembg extra layout, but that is not the smallest safe move for this repo. The exposed failure is not âwrong Gradio API,â not âwrong Torch version,â and not âwrong Python version.â It is specifically âonnxruntime is missing.â So the one-line fix above is the cleanest first pass. (Hugging Face)
2) InstantMesh
Why it crashes now
This Space has the same primary failure as TripoSR. app.py imports rembg, and the preprocessing path creates a rembg session. requirements.txt still lists bare rembg, and the public runtime traceback again shows ModuleNotFoundError: No module named 'onnxruntime'. Unlike TripoSR, its README metadata does not currently specify python_version, even though HF supports pinning it in README YAML. (Hugging Face)
Smallest patch
README.md
title: InstantMesh
emoji:
colorFrom: indigo
colorTo: green
sdk: gradio
sdk_version: 4.26.0
+python_version: 3.10.13
app_file: app.py
pinned: false
short_description: Create a 3D model from an image in 10 seconds!
license: apache-2.0
requirements.txt
torch==2.1.0
torchvision==0.16.0
torchaudio==2.1.0
pytorch-lightning==2.1.2
einops
omegaconf
deepspeed
torchmetrics
webdataset
accelerate
tensorboard
PyMCubes
trimesh
rembg
+onnxruntime
transformers==4.34.1
diffusers==0.19.3
bitsandbytes
imageio[ffmpeg]
xatlas
plyfile
xformers==0.0.22.post7
git+https://github.com/NVlabs/nvdiffrast/
huggingface-hub
Why this patch is needed even if the trigger was external
Again, because the current startup path already contains the failure. The app imports rembg before the UI is ready, and the publicly reported runtime failure is the missing onnxruntime import. The Python pin is a separate hardening step: HF lets Spaces pin python_version, and current ZeroGPU docs explicitly list 3.10.13 as supported. Even if the platform restart is what made the breakage visible, keeping Python fixed removes one more moving part from future cold starts. (Hugging Face)
What I would not do first
I would not begin by mass-upgrading the whole dependency stack. There is already a community PR that proposes a larger cleanup including numpy<2.0.0, Pillow==10.4.0, newer gradio, and simplified requirements. That may be useful later, but the smallest justified first repair is still âadd onnxruntime and pin Python.â (Hugging Face)
If the first patch boots but then fails later
The next smallest hardening step is:
+numpy<2.0.0
+Pillow==10.4.0
I would only do that after confirming that the startup blocker moved past rembg/onnxruntime. The reason is simple: fix the deterministic boot failure first, then deal with second-order runtime drift. (Hugging Face)
3) CRM
Why it crashes now
The current public runtime error is very specific. The Space tries to build a DDIMScheduler from stabilityai/stable-diffusion-2-1-base, and that repo id no longer resolves publicly for the needed scheduler config. The crash trace points into model/crm/model.py at the scheduler initialization. At the same time, app.py defaults --device to "cuda" and moves the model there during startup, which is an additional fragility point once the scheduler problem is fixed. (Hugging Face)
Smallest patch
model/crm/model.py
-self.scheduler = DDIMScheduler.from_pretrained(
- "stabilityai/stable-diffusion-2-1-base",
- subfolder="scheduler",
-)
+self.scheduler = DDIMScheduler.from_pretrained(
+ "sd2-community/stable-diffusion-2-1-base",
+ subfolder="scheduler",
+)
Why this patch is needed even if the trigger was external
Because the current repo points at a model id that no longer works for this code path, and the public crash trace shows exactly that path failing. sd2-community/stable-diffusion-2-1-base exists, and its repo contains scheduler/scheduler_config.json, which is the file CRM is trying to load. So this is not a speculative change. It is a direct one-line replacement for the dead dependency that the current startup path is trying to read. Even if a platform restart is what surfaced the error, any future cold start will keep failing until the repo id is replaced. (Hugging Face)
Optional but very cheap second line
app.py
-parser.add_argument("--device", type=str, default="cuda")
+parser.add_argument("--device", type=str, default="cuda" if torch.cuda.is_available() else "cpu")
Why I would add that second line
The scheduler fix is the primary repair. But once the app gets past that point, startup still does model = model.to(args.device) and passes device=args.device into the pipeline constructor. Right now that default is hard-coded to "cuda". So if the Space is restarted on a CPU-backed environment, or on a GPU path that is temporarily unavailable, the next boot can fail later in startup. That one-line default makes the app more robust without changing its interface or behavior when CUDA is actually available. (Hugging Face)
What I would not do first
I would not start by adding tokens or auth logic. The current public problem is not âthis repo is gated but otherwise correct.â The practical issue is that the code points at a repo id that no longer works for the scheduler path, and a community mirror already exposes the file CRM needs. So the smallest valid fix is to swap the source, not to add authentication plumbing. (Hugging Face)
4) LGM
Why it crashes now
LGM is the outlier. The public runtime error is not a missing Python dependency. It is a compiled-extension failure: the Space downloads its checkpoint, installs a local wheel named diff_gaussian_rasterization-0.0.0-cp310-cp310-linux_x86_64.whl, and then crashes importing that extension because libcudart.so.11.0 is missing. The README already pins python_version: 3.10.13, so Python drift is not the first issue here. The current code also initializes most of the heavy model stack at startup, not lazily. (Hugging Face)
Smallest targeted patch
requirements.txt
torch==2.4.0
xformers
numpy
tyro
diffusers
dearpygui
einops
accelerate
gradio
imageio
imageio-ffmpeg
lpips
matplotlib
packaging
Pillow
pygltflib
rembg[gpu,cli]
+nvidia-cuda-runtime-cu11
rich
safetensors
scikit-image
scikit-learn
scipy
tqdm
transformers
trimesh
kiui >= 0.2.3
xatlas
roma
plyfile
app.py
Add this before from core.models import LGM:
+import ctypes
+import site
+
+for sp in site.getsitepackages():
+ cudart = os.path.join(sp, "nvidia", "cuda_runtime", "lib", "libcudart.so.11.0")
+ if os.path.exists(cudart):
+ ctypes.CDLL(cudart)
+ break
Why this patch is needed even if the trigger was external
Because the current public crash is already precise: the installed compiled extension cannot find libcudart.so.11.0. NVIDIA publishes nvidia-cuda-runtime-cu11 on PyPI as âCUDA Runtime native Libraries,â and this patch preloads the exact library the extension says it is missing before the extension import happens. That is the smallest repo-side change that directly matches the currently exposed failure. A platform-side restart may have exposed it, but once the process restarts, the same binary import will keep failing until the CUDA runtime library problem is addressed. (Hugging Face)
Important honesty note
This is the only one of the four where I would not promise the first patch is enough. It is the smallest targeted fix for the current public error, but native wheels can fail for more than one reason. If the wheel was built against a runtime/ABI combination that still does not match the current Spaces environment, then the next repair is no longer a one-liner. At that point, the smallest real fix becomes either:
- rebuild that extension for the current runtime, or
- move the Space to Docker so CUDA and the extension are under your control.
HFâs current ZeroGPU docs also make clear that ZeroGPU is its own environment with H200-backed shared GPU slices and specific supported versions, so binary assumptions that worked on an older setup can stop being valid after a cold restart. (Hugging Face)
What I would not do first
I would not start by upgrading Gradio, Torch, or the whole app stack just to chase this one error. The current public failure happens before any of that becomes the main issue: it dies when the compiled rasterizer tries to load _C and cannot find libcudart.so.11.0. Solve the explicit binary import error first. Then, if it boots and another error appears, fix that next one. (Hugging Face)
A compact âdo this nowâ version
If I were patching these repos in the smallest reasonable way, I would do exactly this:
TripoSR
+ onnxruntime
InstantMesh
README.md:
+ python_version: 3.10.13
requirements.txt:
+ onnxruntime
CRM
- "stabilityai/stable-diffusion-2-1-base"
+ "sd2-community/stable-diffusion-2-1-base"
Optional second line:
- default="cuda"
+ default="cuda" if torch.cuda.is_available() else "cpu"
LGM
requirements.txt:
+ nvidia-cuda-runtime-cu11
and preload libcudart.so.11.0 before importing core.models. (Hugging Face)
Why I think these are the right first patches
Because they match the actual currently exposed startup failures, not a guessed historical failure, and because they keep the diffs small:
- TripoSR: missing Python package.
- InstantMesh: same missing package, plus missing Python pin.
- CRM: dead external repo id.
- LGM: missing CUDA runtime for a compiled extension. (Hugging Face)