Strange sequence generation with xsum-distillbart (clumped tokens)

I am using the without any modifications. I use sshleifer/distilbart-xsum-12-6 as base, and use the natural questions short answer dataset from HF to fine-tune the model. However, when I do generation for an input testset, I see a very strange behavior. Here is an example:

predicted question original question
isis it easy for 5 year olds to play? is this fun for 2 players?
doesdoes it work with windows 8.1? can this be used to seal/finish a canvas acrylic painting instead of gloss varnish?
isdoes this the right size for an 18 month old? is this worth the money?
whathow many and what size batteries are needed or included? what type of batteries and how many are required for both handsets?

As you can see, the first token is a clumped version of two words, where the second word seems to be the right one.

Also, I have fine tuned this model with other datasets, and never seen this behavior with any other dataset, except the one I mentioned. I have looked at the vocab.json and none of these tokens are in the vocab, so it seems they are OOV generations? I have no idea how to debug this and figure out what is happening, but I certainly don’t think the fine tuning code is buggy since it’s straight from the HF master branch examples, and it works fine for other datasets.

Any suggestions, ideas of what I can try?

Thanks in advance.