As titled. The documentation is a bit vague.
Whether or not to clean up the tokenization spaces.
As titled. The documentation is a bit vague.
Whether or not to clean up the tokenization spaces.
It should remove space artifacts inserted while encoding the sequence. E.g., if you have state-of-the-art
it will be encoded as state - of - the - art
. The cleanup should remove those spaces between -
. Hope it helps!