Here is an detokenized sequence:
print(tokenizer.decode(generated_ids[0][input_ids.shape[-1]:], skip_special_tokens=False, clean_up_tokenization_space=False))
<tool_call>
{“arguments”: {“symbol”: “AAPL”}, “name”: “get_stock_fundamentals”}
</tool_call><|im_end|>
As you see, </tool_call> is a construct without spaces. The original sequence of token ids is
700 6462 28730 2845 28767 which is
“</” “tool” “_” “call” “>”
What Transformers function implements the logic of removal of the spaces? How does it know that ‘tool’, ‘_’ and ‘call’ are part of one keyword?
Would appreciate your guidance.