HuggingFace Space Keeps Crashing Despite 20% CPU Usage - Need Help!
Summary
My HuggingFace Space keeps crashing and becoming unusable despite successfully reducing CPU usage from 85-100% down to 20% through extensive optimizations. Looking for guidance on what might be causing the crashes.
Space Details
- Space Name: myps
- SDK: Docker
- App Type: Next.js application with AI-powered automation
- Current CPU Usage: ~20% (successfully optimized from 85-100%)
- Issue: Space crashes and becomes completely unusable
What the App Does
- Automated hero battle video generation using Canvas/WebGL
- AI-generated custom battle code: Each hero gets unique AI-generated JavaScript code for their powers and abilities
- Records battles using MediaRecorder (WebM format)
- AI-generated hero characters, names, and backstories
- Uploads videos to YouTube automatically with AI-generated metadata
- Uses Puppeteer for browser automation
- Runs continuous automation cycles
Optimizations Already Implemented
CPU Optimizations (Working - Down to 20%)
- Multi-core CPU affinity: Next.js server pinned to core 0, worker to core 1
- Reduced video resolution: 1080x1920 β 540x960
- Lower frame rate: 30fps β 15fps
- Optimized codec: VP9 β VP8
- Reduced bitrates: 8Mbps β 1.5Mbps video, 128kbps β 64kbps audio
- Game speed increased to 2.5x for faster battles
- Frame skipping in automation mode
Memory Optimizations
- Node.js memory limits: Server 384MB, Worker 256MB, Repair 128MB
- Chrome
--max_old_space_size=256
per process --disable-dev-shm-usage
and memory pressure flags- Garbage collection enabled with
--expose-gc
- Regular cleanup of
/tmp
files
Puppeteer/Chrome Optimizations
args: [
'--no-sandbox',
'--disable-setuid-sandbox',
'--disable-dev-shm-usage',
'--disable-web-security',
'--disable-features=VizDisplayCompositor',
'--disable-background-timer-throttling',
'--disable-renderer-backgrounding',
'--disable-backgrounding-occluded-windows',
'--disable-ipc-flooding-protection',
'--no-zygote',
'--process-per-site',
'--max_old_space_size=256',
'--memory-pressure-off',
'--autoplay-policy=no-user-gesture-required',
'--use-fake-ui-for-media-stream',
'--window-size=540,960'
]
Process Management
- Active watchdog script monitoring all processes
- 15-minute worker timeout with forced restart
- Automatic process restart on crashes
- Hourly resource cleanup
- Memory and disk usage monitoring
Current Monitoring Output
[WATCHDOG] === CPU Core Utilization Monitor ===
%Cpu0 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu2 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
[WATCHDOG] Process distribution:
14 0 next-server next-server (v15.3.3)
27 1 node node --max-old-space-size=256 dist/worker.js
31 3 node node --max-old-space-size=128 dist/repair.worker.js
The Problem
Despite achieving 20% CPU usage (a massive improvement!), the space still crashes and becomes unusable. The processes are properly distributed across CPU cores and memory usage appears controlled.
Questions for HuggingFace Support
-
What could cause crashes at 20% CPU usage? Are there other resource limits Iβm hitting?
-
Memory limits on free tier? Whatβs the actual RAM limit? My processes use 384MB + 256MB + 128MB = ~768MB total.
-
Disk space limits? Could
/tmp
disk usage cause crashes? Iβm cleaning up regularly. -
Network/API rate limiting? Could YouTube API calls or external requests trigger space suspension?
-
Docker resource limits? Are there container limits Iβm not aware of?
-
Process limits? Am I running too many Node.js processes simultaneously?
-
WebM video recording issues? Could MediaRecorder cause crashes even at low bitrates?
Code Repository Structure
βββ start.sh (Watchdog script with multi-core affinity)
βββ worker.ts (Puppeteer automation with optimized Chrome)
βββ components/BattleArena.tsx (Canvas rendering + MediaRecorder)
βββ lib/BroadcastUI.ts (Optimized UI rendering)
βββ Dockerfile (Next.js Docker setup)
βββ package.json (Dependencies)
What Iβve Tried
- Reduced worker timeout from 30min to 15min
- Added aggressive resource cleanup every hour
- Implemented proper browser process cleanup
- Added memory monitoring and garbage collection
- Used
taskset
for CPU core affinity
Expected vs Actual
- Expected: Stable operation at 20% CPU
- Actual: Space crashes and becomes unusable despite low CPU
Request
Could someone from HuggingFace support help identify what resource limit Iβm hitting? The CPU optimization worked perfectly, but something else is causing the crashes.
Is there a way to get detailed crash logs or resource usage reports to debug this further?
Additional Info
- Space worked fine during testing phases
- Crashes occur during automated cycles
- Manual operations work correctly
- Problem persists after all optimizations
Thank you for any guidance!
Tags: Spaces #docker #nextjs #puppeteer #automation #crashes #cpu-optimization #memory-limits