I tried to migrate from HFApiModel to TransformerModel as I do not wish to incur more cost but I had this error. I am using ZeroGPU
model = TransformersModel(
# model_id="Qwen/Qwen2.5-Coder-14B-Instruct",
model_id="meta-llama/Llama-3.2-3B-Instruct",
device_map="cuda"
,max_new_tokens=5000,torch_dtype="bfloat16"
)
I tried to solve it on my own via but the error persist. What else should I try?
1 Like
I think it’s the same kind of error as in the past, which can be avoided by quantization, and it’s interesting that it also occurs in float32 .
I found a hypothesis that the cause may be a failure to tokenize a special token.
opened 09:20AM - 09 Nov 23 UTC
### Describe the issue
Issue: I pulled last commits from the repo, tried to run… cli inference, the network is creating NaN as probability outputs.
Command:
```
python -m llava.serve.cli --model-path liuhaotian/llava-v1.5-7b --image-file "https://llava-vl.github.io/static/images/view.jpg" --load-4bit
```
I also tried to create a fresh environment, still same bug.
I explored the token in input to the network, and I noticed a strange -200 token. Not sure if this is causing the issue, maybe someone can have a look? I'm trying to debug it and come here if I have news!
Log:
```
Traceback (most recent call last):
File "/home/riccardoricci/miniconda3/envs/chat_osm/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/riccardoricci/miniconda3/envs/chat_osm/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/media/data/Riccardo/chat_with_OSM/LLaVA/llava/serve/cli.py", line 125, in <module>
main(args)
File "/media/data/Riccardo/chat_with_OSM/LLaVA/llava/serve/cli.py", line 95, in main
output_ids = model.generate(
File "/home/riccardoricci/miniconda3/envs/chat_osm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/riccardoricci/miniconda3/envs/chat_osm/lib/python3.10/site-packages/transformers/generation/utils.py", line 1588, in generate
return self.sample(
File "/home/riccardoricci/miniconda3/envs/chat_osm/lib/python3.10/site-packages/transformers/generation/utils.py", line 2678, in sample
next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0
```

opened 12:38AM - 19 Jul 23 UTC
model-usage
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
token… izer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf")
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-chat-hf")
inputs = ...
inputs = tokenizer.batch_encode_plus(inputs, return_tensors="pt", padding=True)
model.generate(**inputs, **generate_kwargs)
```
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0
I got this error while doing inference for text generation, in particular when the batch size is great than 1. I did not get this error and generate correctly when the batch size is set to 1.
Does anyone see the same issue?
1 Like
interesting, now I tried unsloth llama 3.2 bnb 4bits and it threw another error:
RuntimeError: All input tensors need to be on the same GPU, but found some tensors to not be on a GPU:
[(torch.Size([1, 4718592]), device(type=‘cpu’)), (torch.Size([147456]), device(type=‘cpu’)), (torch.Size([3072, 3072]), device(type=‘cpu’))]
not sure how to change the input tensor into gpu. I am using stream_to_gradio()
to send the new message to the agent.
I do thank you for your help.
edit:
not sure if this is the right path by doing
model = TransformersModel(
# model_id="Qwen/Qwen2.5-Coder-14B-Instruct",
model_id="unsloth/Llama-3.2-3B-Instruct-unsloth-bnb-4bit",
device_map="cuda"
)
model.model = model.model.to("cuda")
or
model = TransformersModel(
# model_id="Qwen/Qwen2.5-Coder-14B-Instruct",
model_id="unsloth/Llama-3.2-3B-Instruct-unsloth-bnb-4bit",
device_map="cuda"
)
model.model.to("cuda")
edit: neither worked
1 Like
It’s been properly .to(model.device) ed…
I wonder if this is another bug.
)
else:
prompt_tensor = self.tokenizer.apply_chat_template(
messages,
tools=[get_tool_json_schema(tool) for tool in tools_to_call_from] if tools_to_call_from else None,
return_tensors="pt",
return_dict=True,
add_generation_prompt=True if tools_to_call_from else False,
)
prompt_tensor = prompt_tensor.to(self.model.device)
count_prompt_tokens = prompt_tensor["input_ids"].shape[1]
if stop_sequences:
stopping_criteria = self.make_stopping_criteria(
stop_sequences, tokenizer=self.processor if hasattr(self, "processor") else self.tokenizer
)
else:
stopping_criteria = None
out = self.model.generate(
How about like this?
model = TransformersModel(
# model_id="Qwen/Qwen2.5-Coder-14B-Instruct",
model_id="unsloth/Llama-3.2-3B-Instruct-unsloth-bnb-4bit",
device_map="cuda"
)
print(model.model.device)
1 Like