How can I decode token by token, i.e. without the tokenizer removing spaces for punctuation? In the example below, i would expect [CLS] hello world . [SEP]
, i.e. a space between world
and .
.
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
x = tokenizer.encode("Hello World.")
tokenizer.decode(x, clean_up_tokenization_spaces=False)
# '[CLS] hello world. [SEP]'