The documentation for GPT2Tokenizer suggests that we should keep the default of not adding spaces before words (
I understand that GPT2 was trained without adding spaces at the start of sentences, which results in different tokenizations.
However, I imagine that most of the text was similar to:
document_n could be:
This is a long article from wikipedia. Lots of sentences.
So most of the time, new sentences would actually start with a space (separation from previous sentence) or a line break. I’m not aware of extra preprocessing that would remove spaces after punctuation?
In that case, it not obvious of what should be the best strategy when fine-tuning (adding spaces before words or not) as we may want to replicate what was the most common in initial dataset.
I would love any comment!