HuggingFace Space Keeps Crashing Despite 20% CPU Usage - Need Help!

:sos_button: HuggingFace Space Keeps Crashing Despite 20% CPU Usage - Need Help!

Summary

My HuggingFace Space keeps crashing and becoming unusable despite successfully reducing CPU usage from 85-100% down to 20% through extensive optimizations. Looking for guidance on what might be causing the crashes.

Space Details

  • Space Name: myps
  • SDK: Docker
  • App Type: Next.js application with AI-powered automation
  • Current CPU Usage: ~20% (successfully optimized from 85-100%)
  • Issue: Space crashes and becomes completely unusable

What the App Does

  • Automated hero battle video generation using Canvas/WebGL
  • AI-generated custom battle code: Each hero gets unique AI-generated JavaScript code for their powers and abilities
  • Records battles using MediaRecorder (WebM format)
  • AI-generated hero characters, names, and backstories
  • Uploads videos to YouTube automatically with AI-generated metadata
  • Uses Puppeteer for browser automation
  • Runs continuous automation cycles

Optimizations Already Implemented

:white_check_mark: CPU Optimizations (Working - Down to 20%)

  • Multi-core CPU affinity: Next.js server pinned to core 0, worker to core 1
  • Reduced video resolution: 1080x1920 β†’ 540x960
  • Lower frame rate: 30fps β†’ 15fps
  • Optimized codec: VP9 β†’ VP8
  • Reduced bitrates: 8Mbps β†’ 1.5Mbps video, 128kbps β†’ 64kbps audio
  • Game speed increased to 2.5x for faster battles
  • Frame skipping in automation mode

:white_check_mark: Memory Optimizations

  • Node.js memory limits: Server 384MB, Worker 256MB, Repair 128MB
  • Chrome --max_old_space_size=256 per process
  • --disable-dev-shm-usage and memory pressure flags
  • Garbage collection enabled with --expose-gc
  • Regular cleanup of /tmp files

:white_check_mark: Puppeteer/Chrome Optimizations

args: [
  '--no-sandbox',
  '--disable-setuid-sandbox',
  '--disable-dev-shm-usage',
  '--disable-web-security',
  '--disable-features=VizDisplayCompositor',
  '--disable-background-timer-throttling',
  '--disable-renderer-backgrounding',
  '--disable-backgrounding-occluded-windows',
  '--disable-ipc-flooding-protection',
  '--no-zygote',
  '--process-per-site',
  '--max_old_space_size=256',
  '--memory-pressure-off',
  '--autoplay-policy=no-user-gesture-required',
  '--use-fake-ui-for-media-stream',
  '--window-size=540,960'
]

:white_check_mark: Process Management

  • Active watchdog script monitoring all processes
  • 15-minute worker timeout with forced restart
  • Automatic process restart on crashes
  • Hourly resource cleanup
  • Memory and disk usage monitoring

Current Monitoring Output

[WATCHDOG] === CPU Core Utilization Monitor ===
%Cpu0  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st 
%Cpu2  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st 
[WATCHDOG] Process distribution:
    14   0 next-server    next-server (v15.3.3)
    27   1 node           node --max-old-space-size=256 dist/worker.js
    31   3 node           node --max-old-space-size=128 dist/repair.worker.js

The Problem

Despite achieving 20% CPU usage (a massive improvement!), the space still crashes and becomes unusable. The processes are properly distributed across CPU cores and memory usage appears controlled.

Questions for HuggingFace Support

  1. What could cause crashes at 20% CPU usage? Are there other resource limits I’m hitting?

  2. Memory limits on free tier? What’s the actual RAM limit? My processes use 384MB + 256MB + 128MB = ~768MB total.

  3. Disk space limits? Could /tmp disk usage cause crashes? I’m cleaning up regularly.

  4. Network/API rate limiting? Could YouTube API calls or external requests trigger space suspension?

  5. Docker resource limits? Are there container limits I’m not aware of?

  6. Process limits? Am I running too many Node.js processes simultaneously?

  7. WebM video recording issues? Could MediaRecorder cause crashes even at low bitrates?

Code Repository Structure

β”œβ”€β”€ start.sh (Watchdog script with multi-core affinity)
β”œβ”€β”€ worker.ts (Puppeteer automation with optimized Chrome)
β”œβ”€β”€ components/BattleArena.tsx (Canvas rendering + MediaRecorder)
β”œβ”€β”€ lib/BroadcastUI.ts (Optimized UI rendering)
β”œβ”€β”€ Dockerfile (Next.js Docker setup)
└── package.json (Dependencies)

What I’ve Tried

  • Reduced worker timeout from 30min to 15min
  • Added aggressive resource cleanup every hour
  • Implemented proper browser process cleanup
  • Added memory monitoring and garbage collection
  • Used taskset for CPU core affinity

Expected vs Actual

  • Expected: Stable operation at 20% CPU
  • Actual: Space crashes and becomes unusable despite low CPU

Request

Could someone from HuggingFace support help identify what resource limit I’m hitting? The CPU optimization worked perfectly, but something else is causing the crashes.

Is there a way to get detailed crash logs or resource usage reports to debug this further?

Additional Info

  • Space worked fine during testing phases
  • Crashes occur during automated cycles
  • Manual operations work correctly
  • Problem persists after all optimizations

Thank you for any guidance! :folded_hands:


Tags: Spaces #docker #nextjs #puppeteer #automation #crashes #cpu-optimization #memory-limits

1 Like

I’m not support, but…
I think the cause of Hugging Face’s virtual environment crashing is often due to RAM being used up rather than CPU overload.

1

Unless there is a very unusual command, I don’t think it will crash with a CPU usage rate of 20%.

2

(In free tier,) 16GB RAM + 50GB SSD for swapping per space.

3

50GB as a whole per space.

5

Some programs such as reverse proxies are prohibited, but in that case, I think the build will fail rather than crash.

ram is only being like 2-3gb max to max and that too only sometimes, and storage I am cleaning it up after each upload

1 Like