How to properly tokenize and pack sequences with EOS token handling for GPT-2 fine-tuning in Hugging Face Transformers?

I will check it out! (and was told about it yesterday!) But how does it handle the edge cases where the <eos> might not be tokenized correctly when it’s next to a space for example? I was anecdotally told this can be a severe issue.

I’d personally put an assert to the tokenization in ds.map asserting that the eos_id does appear in the “right” place.