Tokenizer: what function removes spaces between '<' and '>'?

idruker · December 9, 2024, 7:46pm

Here is an detokenized sequence:

print(tokenizer.decode(generated_ids[0][input_ids.shape[-1]:], skip_special_tokens=False, clean_up_tokenization_space=False))

<tool_call>
{“arguments”: {“symbol”: “AAPL”}, “name”: “get_stock_fundamentals”}
</tool_call><|im_end|>

As you see, </tool_call> is a construct without spaces. The original sequence of token ids is
700 6462 28730 2845 28767 which is
“</” “tool” “_” “call” “>”

What Transformers function implements the logic of removal of the spaces? How does it know that ‘tool’, ‘_’ and ‘call’ are part of one keyword?

Would appreciate your guidance.

Topic		Replies	Views
How to decode with spaces? 🤗Tokenizers	0	1867	April 28, 2022
How to avoid PreTrainedTokenizerFast.decode to add space between tokens 🤗Transformers	3	49	April 22, 2025
What does the parameter 'clean_up_tokenization_spaces' do in the tokenizer.decode function? Beginners	2	9098	July 8, 2025
Remove only certain special token id during tokenizer decode 🤗Tokenizers	3	2581	October 26, 2022
Difference between tokenizer and convert_tokens_to_ids 🤗Tokenizers	0	302	May 12, 2024

Tokenizer: what function removes spaces between '<' and '>'?

Related topics