HuggingFace Space Keeps Crashing Despite 20% CPU Usage - Need Help!

bonesmasher · July 2, 2025, 7:38am

HuggingFace Space Keeps Crashing Despite 20% CPU Usage - Need Help!

Summary

My HuggingFace Space keeps crashing and becoming unusable despite successfully reducing CPU usage from 85-100% down to 20% through extensive optimizations. Looking for guidance on what might be causing the crashes.

Space Details

Space Name: myps
SDK: Docker
App Type: Next.js application with AI-powered automation
Current CPU Usage: ~20% (successfully optimized from 85-100%)
Issue: Space crashes and becomes completely unusable

What the App Does

Automated hero battle video generation using Canvas/WebGL
AI-generated custom battle code: Each hero gets unique AI-generated JavaScript code for their powers and abilities
Records battles using MediaRecorder (WebM format)
AI-generated hero characters, names, and backstories
Uploads videos to YouTube automatically with AI-generated metadata
Uses Puppeteer for browser automation
Runs continuous automation cycles

Optimizations Already Implemented

CPU Optimizations (Working - Down to 20%)

Multi-core CPU affinity: Next.js server pinned to core 0, worker to core 1
Reduced video resolution: 1080x1920 → 540x960
Lower frame rate: 30fps → 15fps
Optimized codec: VP9 → VP8
Reduced bitrates: 8Mbps → 1.5Mbps video, 128kbps → 64kbps audio
Game speed increased to 2.5x for faster battles
Frame skipping in automation mode

Memory Optimizations

Node.js memory limits: Server 384MB, Worker 256MB, Repair 128MB
Chrome --max_old_space_size=256 per process
--disable-dev-shm-usage and memory pressure flags
Garbage collection enabled with --expose-gc
Regular cleanup of /tmp files

Puppeteer/Chrome Optimizations

args: [
  '--no-sandbox',
  '--disable-setuid-sandbox',
  '--disable-dev-shm-usage',
  '--disable-web-security',
  '--disable-features=VizDisplayCompositor',
  '--disable-background-timer-throttling',
  '--disable-renderer-backgrounding',
  '--disable-backgrounding-occluded-windows',
  '--disable-ipc-flooding-protection',
  '--no-zygote',
  '--process-per-site',
  '--max_old_space_size=256',
  '--memory-pressure-off',
  '--autoplay-policy=no-user-gesture-required',
  '--use-fake-ui-for-media-stream',
  '--window-size=540,960'
]

Process Management

Active watchdog script monitoring all processes
15-minute worker timeout with forced restart
Automatic process restart on crashes
Hourly resource cleanup
Memory and disk usage monitoring

Current Monitoring Output

[WATCHDOG] === CPU Core Utilization Monitor ===
%Cpu0  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st 
%Cpu2  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st 
[WATCHDOG] Process distribution:
    14   0 next-server    next-server (v15.3.3)
    27   1 node           node --max-old-space-size=256 dist/worker.js
    31   3 node           node --max-old-space-size=128 dist/repair.worker.js

The Problem

Despite achieving 20% CPU usage (a massive improvement!), the space still crashes and becomes unusable. The processes are properly distributed across CPU cores and memory usage appears controlled.

Questions for HuggingFace Support

What could cause crashes at 20% CPU usage? Are there other resource limits I’m hitting?
Memory limits on free tier? What’s the actual RAM limit? My processes use 384MB + 256MB + 128MB = ~768MB total.
Disk space limits? Could /tmp disk usage cause crashes? I’m cleaning up regularly.
Network/API rate limiting? Could YouTube API calls or external requests trigger space suspension?
Docker resource limits? Are there container limits I’m not aware of?
Process limits? Am I running too many Node.js processes simultaneously?
WebM video recording issues? Could MediaRecorder cause crashes even at low bitrates?

Code Repository Structure

├── start.sh (Watchdog script with multi-core affinity)
├── worker.ts (Puppeteer automation with optimized Chrome)
├── components/BattleArena.tsx (Canvas rendering + MediaRecorder)
├── lib/BroadcastUI.ts (Optimized UI rendering)
├── Dockerfile (Next.js Docker setup)
└── package.json (Dependencies)

What I’ve Tried

Reduced worker timeout from 30min to 15min
Added aggressive resource cleanup every hour
Implemented proper browser process cleanup
Added memory monitoring and garbage collection
Used taskset for CPU core affinity

Expected vs Actual

Expected: Stable operation at 20% CPU
Actual: Space crashes and becomes unusable despite low CPU

Request

Could someone from HuggingFace support help identify what resource limit I’m hitting? The CPU optimization worked perfectly, but something else is causing the crashes.

Is there a way to get detailed crash logs or resource usage reports to debug this further?

Additional Info

Space worked fine during testing phases
Crashes occur during automated cycles
Manual operations work correctly
Problem persists after all optimizations

Thank you for any guidance!

Tags: Spaces #docker #nextjs #puppeteer #automation #crashes #cpu-optimization #memory-limits

John6666 · July 2, 2025, 9:55am

I’m not support, but…
I think the cause of Hugging Face’s virtual environment crashing is often due to RAM being used up rather than CPU overload.

1

Unless there is a very unusual command, I don’t think it will crash with a CPU usage rate of 20%.

2

(In free tier,) 16GB RAM + 50GB SSD for swapping per space.

3

50GB as a whole per space.

5

Some programs such as reverse proxies are prohibited, but in that case, I think the build will fail rather than crash.

bonesmasher · July 3, 2025, 3:29pm

ram is only being like 2-3gb max to max and that too only sometimes, and storage I am cleaning it up after each upload

Topic		Replies	Views
Space crashing with X-CLIP model Spaces	2	902	September 12, 2022
Hugging Face - runtime error Spaces	4	2554	April 8, 2023
HF runtime error. Memory limit exceeded Spaces	0	206	May 23, 2024
Space doesn't start Spaces	6	6509	November 27, 2023
What do I do about a spaces app no longer working? Beginners	0	328	July 2, 2022

HuggingFace Space Keeps Crashing Despite 20% CPU Usage - Need Help!

HuggingFace Space Keeps Crashing Despite 20% CPU Usage - Need Help!

Summary

Space Details

What the App Does

Optimizations Already Implemented

CPU Optimizations (Working - Down to 20%)

Memory Optimizations

Puppeteer/Chrome Optimizations

Process Management

Current Monitoring Output

The Problem

Questions for HuggingFace Support

Code Repository Structure

What I’ve Tried

Expected vs Actual

Request

Additional Info

Related topics