CUDA related error and factory reboot not working

Recently, this Space is not working properly. I’ve tried the factory reboot, but it doesn’t seem to work.

The log shows the following error:

Traceback (most recent call last):
  File "/home/user/.pyenv/versions/3.9.13/lib/python3.9/site-packages/gradio/routes.py", line 247, in run_predict
    output = await app.blocks.process_api(
  File "/home/user/.pyenv/versions/3.9.13/lib/python3.9/site-packages/gradio/blocks.py", line 641, in process_api
    predictions, duration = await self.call_function(fn_index, processed_input)
  File "/home/user/.pyenv/versions/3.9.13/lib/python3.9/site-packages/gradio/blocks.py", line 556, in call_function
    prediction = await anyio.to_thread.run_sync(
  File "/home/user/.pyenv/versions/3.9.13/lib/python3.9/site-packages/anyio/to_thread.py", line 31, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "/home/user/.pyenv/versions/3.9.13/lib/python3.9/site-packages/anyio/_backends/_asyncio.py", line 937, in run_sync_in_worker_thread
    return await future
  File "/home/user/.pyenv/versions/3.9.13/lib/python3.9/site-packages/anyio/_backends/_asyncio.py", line 867, in run
    result = context.run(func, *args)
  File "/home/user/app/model.py", line 1241, in run_with_translation
    frames = self.run(text, seed, only_first_stage,image_prompt)
  File "/home/user/app/model.py", line 1178, in run
    set_random_seed(seed)
  File "/home/user/.pyenv/versions/3.9.13/lib/python3.9/site-packages/SwissArmyTransformer/arguments.py", line 429, in set_random_seed
    torch.manual_seed(seed)
  File "/home/user/.pyenv/versions/3.9.13/lib/python3.9/site-packages/torch/random.py", line 40, in manual_seed
    torch.cuda.manual_seed_all(seed)
  File "/home/user/.pyenv/versions/3.9.13/lib/python3.9/site-packages/torch/cuda/random.py", line 113, in manual_seed_all
    _lazy_call(cb, seed_all=True)
  File "/home/user/.pyenv/versions/3.9.13/lib/python3.9/site-packages/torch/cuda/__init__.py", line 156, in _lazy_call
    callable()
  File "/home/user/.pyenv/versions/3.9.13/lib/python3.9/site-packages/torch/cuda/random.py", line 111, in cb
    default_generator.manual_seed(seed)
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

We haven’t changed the code for two weeks and the Space was working fine until a few days ago, though we needed to reboot the Space due to CUDA OOM from time to time (See this discussion). Also, it works fine in my GCP environment if I clone and run the Space.

How can I fix this?

Currently investigating.

Edit: Fixed

2 Likes