Inconsistent behaviors with free tier

ejschwartz · February 10, 2025, 3:47pm

I’ve run into a few intermittent problems with my space that I’m still investigating.

The two intermittent problems are:

Sometimes I will get weird errors that might be triggered by low memory, such as:

===== Application Startup at 2025-02-10 01:58:53 =====

Starting!
               total        used        free      shared  buff/cache   available
Mem:          126777      116895       25086           2       21430        9882
Swap:         905990      148586      757403

Downloading models now
Loaded base model
The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable.
Some parameters are on the meta device because they were offloaded to the disk and cpu.
Loaded peft model
Quantizing the model
Traceback (most recent call last):
  File "/home/user/app/main.py", line 64, in <module>
    model = torch.quantization.quantize_dynamic(model, {torch.nn.Linear}, dtype=torch.qint8)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/miniconda3/envs/idioms/lib/python3.11/site-packages/torch/ao/quantization/quantize.py", line 564, in quantize_dynamic
    convert(model, mapping, inplace=True)
  File "/home/user/miniconda3/envs/idioms/lib/python3.11/site-packages/torch/ao/quantization/quantize.py", line 659, in convert
    _convert(
  File "/home/user/miniconda3/envs/idioms/lib/python3.11/site-packages/torch/ao/quantization/quantize.py", line 716, in _convert
    _convert(
  File "/home/user/miniconda3/envs/idioms/lib/python3.11/site-packages/torch/ao/quantization/quantize.py", line 716, in _convert
    _convert(
  File "/home/user/miniconda3/envs/idioms/lib/python3.11/site-packages/torch/ao/quantization/quantize.py", line 716, in _convert
    _convert(
  [Previous line repeated 1 more time]
  File "/home/user/miniconda3/envs/idioms/lib/python3.11/site-packages/torch/ao/quantization/quantize.py", line 724, in _convert
    reassign[name] = swap_module(
                     ^^^^^^^^^^^^
  File "/home/user/miniconda3/envs/idioms/lib/python3.11/site-packages/torch/ao/quantization/quantize.py", line 766, in swap_module
    new_mod = qmod.from_float(
              ^^^^^^^^^^^^^^^^
  File "/home/user/miniconda3/envs/idioms/lib/python3.11/site-packages/torch/ao/nn/quantized/dynamic/modules/linear.py", line 138, in from_float
    qweight = _quantize_weight(mod.weight.float(), weight_observer)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/miniconda3/envs/idioms/lib/python3.11/site-packages/torch/ao/nn/quantized/modules/utils.py", line 70, in _quantize_weight
    wt_scale, wt_zp = observer.calculate_qparams()
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/miniconda3/envs/idioms/lib/python3.11/site-packages/torch/ao/quantization/observer.py", line 546, in calculate_qparams
    return self._calculate_qparams(self.min_val, self.max_val)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/miniconda3/envs/idioms/lib/python3.11/site-packages/torch/ao/quantization/observer.py", line 343, in _calculate_qparams
    if not check_min_max_valid(min_val, max_val):
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/miniconda3/envs/idioms/lib/python3.11/site-packages/torch/ao/quantization/utils.py", line 406, in check_min_max_valid
    if min_val == float("inf") and max_val == float("-inf"):
  File "/home/user/miniconda3/envs/idioms/lib/python3.11/site-packages/torch/_meta_registrations.py", line 6471, in meta_local_scalar_dense
    raise RuntimeError("Tensor.item() cannot be called on meta tensors")
RuntimeError: Tensor.item() cannot be called on meta tensors

Other times I don’t get the message about model layers being moved to the disk.

Sometimes the time to run a prediction steadily increases. For instance, one time it took 93s for the first query, 149s for the next, followed by 210s. “Normally” when everything is working, each query takes about 60-90s and this does not increase over time.

I’ve started printing the results of free -m, and I’ve noticed that the amount of memory varies quite a bit, which I think might explain problem #1. But unsure about #2. I’m still investigating, but figured I would post to see if anyone has encountered similar behaviors before.

John6666 · February 10, 2025, 9:56pm

Sometimes. Each client instance is independent, but there is only one equivalent hardware per space. It is possible that someone else is using that space at the same time. In the case of GPU space, if there is extra VRAM, it won’t slow down much even if it is executed in parallel, but in the case of CPU, that is often not the case.

There are ways to limit the number of people who can execute simultaneously on the Gradio side or to make it private.

ejschwartz · February 11, 2025, 1:06am

Gradio has a queue of one enabled, so that is not the problem.

Topic		Replies	Views
[RuntimeError] GPU is required to quantize or run quantize model – Qwen1.5-0.5B-Chat in my Space Beginners	3	34	May 23, 2025
Build Error - Bad Request Spaces	6	2111	February 17, 2023
Unable to run quantized model with Zero GPU space Spaces	0	102	June 11, 2024
Spaces using CPU Spaces	2	482	May 11, 2023
Why isn't quantization config reducing memory usage? Intermediate	0	93	August 16, 2024

Inconsistent behaviors with free tier

Related topics