What does the parameter 'clean_up_tokenization_spaces' do in the tokenizer.decode function?

ivnle · May 2, 2022, 2:28am

As titled. The documentation is a bit vague.

Whether or not to clean up the tokenization spaces.

morenolq · December 5, 2022, 6:03pm

It should remove space artifacts inserted while encoding the sequence. E.g., if you have state-of-the-art it will be encoded as state - of - the - art. The cleanup should remove those spaces between -. Hope it helps!

gspeter0 · July 8, 2025, 9:45am

But may be ‘-’ these like tokens is important for response or output

You sure about that

Topic		Replies	Views
How to avoid PreTrainedTokenizerFast.decode to add space between tokens 🤗Transformers	3	43	April 22, 2025
How to decode with spaces? 🤗Tokenizers	0	1864	April 28, 2022
How to make tokenizer add the spaces correctly when decoding a sequence when set add_prefix_space=False 🤗Tokenizers	0	568	October 9, 2023
Tokenizer: what function removes spaces between '<' and '>'? 🤗Tokenizers	0	51	December 9, 2024
BPEDecoder no spaces after special tokens Intermediate	4	2045	April 19, 2023

What does the parameter 'clean_up_tokenization_spaces' do in the tokenizer.decode function?

Related topics