I’ve run into a few intermittent problems with my space that I’m still investigating.
The two intermittent problems are:
- Sometimes I will get weird errors that might be triggered by low memory, such as:
===== Application Startup at 2025-02-10 01:58:53 =====
Starting!
total used free shared buff/cache available
Mem: 126777 116895 25086 2 21430 9882
Swap: 905990 148586 757403
Downloading models now
Loaded base model
The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable.
Some parameters are on the meta device because they were offloaded to the disk and cpu.
Loaded peft model
Quantizing the model
Traceback (most recent call last):
File "/home/user/app/main.py", line 64, in <module>
model = torch.quantization.quantize_dynamic(model, {torch.nn.Linear}, dtype=torch.qint8)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/miniconda3/envs/idioms/lib/python3.11/site-packages/torch/ao/quantization/quantize.py", line 564, in quantize_dynamic
convert(model, mapping, inplace=True)
File "/home/user/miniconda3/envs/idioms/lib/python3.11/site-packages/torch/ao/quantization/quantize.py", line 659, in convert
_convert(
File "/home/user/miniconda3/envs/idioms/lib/python3.11/site-packages/torch/ao/quantization/quantize.py", line 716, in _convert
_convert(
File "/home/user/miniconda3/envs/idioms/lib/python3.11/site-packages/torch/ao/quantization/quantize.py", line 716, in _convert
_convert(
File "/home/user/miniconda3/envs/idioms/lib/python3.11/site-packages/torch/ao/quantization/quantize.py", line 716, in _convert
_convert(
[Previous line repeated 1 more time]
File "/home/user/miniconda3/envs/idioms/lib/python3.11/site-packages/torch/ao/quantization/quantize.py", line 724, in _convert
reassign[name] = swap_module(
^^^^^^^^^^^^
File "/home/user/miniconda3/envs/idioms/lib/python3.11/site-packages/torch/ao/quantization/quantize.py", line 766, in swap_module
new_mod = qmod.from_float(
^^^^^^^^^^^^^^^^
File "/home/user/miniconda3/envs/idioms/lib/python3.11/site-packages/torch/ao/nn/quantized/dynamic/modules/linear.py", line 138, in from_float
qweight = _quantize_weight(mod.weight.float(), weight_observer)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/miniconda3/envs/idioms/lib/python3.11/site-packages/torch/ao/nn/quantized/modules/utils.py", line 70, in _quantize_weight
wt_scale, wt_zp = observer.calculate_qparams()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/miniconda3/envs/idioms/lib/python3.11/site-packages/torch/ao/quantization/observer.py", line 546, in calculate_qparams
return self._calculate_qparams(self.min_val, self.max_val)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/miniconda3/envs/idioms/lib/python3.11/site-packages/torch/ao/quantization/observer.py", line 343, in _calculate_qparams
if not check_min_max_valid(min_val, max_val):
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/miniconda3/envs/idioms/lib/python3.11/site-packages/torch/ao/quantization/utils.py", line 406, in check_min_max_valid
if min_val == float("inf") and max_val == float("-inf"):
File "/home/user/miniconda3/envs/idioms/lib/python3.11/site-packages/torch/_meta_registrations.py", line 6471, in meta_local_scalar_dense
raise RuntimeError("Tensor.item() cannot be called on meta tensors")
RuntimeError: Tensor.item() cannot be called on meta tensors
Other times I don’t get the message about model layers being moved to the disk.
- Sometimes the time to run a prediction steadily increases. For instance, one time it took 93s for the first query, 149s for the next, followed by 210s. “Normally” when everything is working, each query takes about 60-90s and this does not increase over time.
I’ve started printing the results of free -m
, and I’ve noticed that the amount of memory varies quite a bit, which I think might explain problem #1. But unsure about #2. I’m still investigating, but figured I would post to see if anyone has encountered similar behaviors before.