I will check it out! (and was told about it yesterday!) But how does it handle the edge cases where the <eos>
might not be tokenized correctly when it’s next to a space for example? I was anecdotally told this can be a severe issue.
I’d personally put an assert to the tokenization in ds.map
asserting that the eos_id
does appear in the “right” place.