How to decode with spaces?

timbmg · April 28, 2022, 1:39pm

How can I decode token by token, i.e. without the tokenizer removing spaces for punctuation? In the example below, i would expect [CLS] hello world . [SEP], i.e. a space between world and ..

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
x = tokenizer.encode("Hello World.")
tokenizer.decode(x, clean_up_tokenization_spaces=False)
# '[CLS] hello world. [SEP]'

Topic		Replies	Views
How to avoid PreTrainedTokenizerFast.decode to add space between tokens 🤗Transformers	3	46	April 22, 2025
How to make tokenizer add the spaces correctly when decoding a sequence when set add_prefix_space=False 🤗Tokenizers	0	568	October 9, 2023
What does the parameter 'clean_up_tokenization_spaces' do in the tokenizer.decode function? Beginners	2	9067	July 8, 2025
BPEDecoder no spaces after special tokens Intermediate	4	2047	April 19, 2023
Added Tokens Not Decoding with Spaces 🤗Tokenizers	3	2843	January 19, 2024

How to decode with spaces?

Related topics