What does the parameter 'clean_up_tokenization_spaces' do in the tokenizer.decode function?

As titled. The documentation is a bit vague.

Whether or not to clean up the tokenization spaces.

It should remove space artifacts inserted while encoding the sequence. E.g., if you have state-of-the-art it will be encoded as state - of - the - art. The cleanup should remove those spaces between -. Hope it helps!

4 Likes