Hi,
I am using the run_summarization.py without any modifications. I use sshleifer/distilbart-xsum-12-6 as base, and use the natural questions short answer dataset from HF to fine-tune the model. However, when I do generation for an input testset, I see a very strange behavior. Here is an example:
predicted question | original question |
---|---|
isis it easy for 5 year olds to play? | is this fun for 2 players? |
doesdoes it work with windows 8.1? | can this be used to seal/finish a canvas acrylic painting instead of gloss varnish? |
isdoes this the right size for an 18 month old? | is this worth the money? |
whathow many and what size batteries are needed or included? | what type of batteries and how many are required for both handsets? |
As you can see, the first token is a clumped version of two words, where the second word seems to be the right one.
Also, I have fine tuned this model with other datasets, and never seen this behavior with any other dataset, except the one I mentioned. I have looked at the vocab.json and none of these tokens are in the vocab, so it seems they are OOV generations? I have no idea how to debug this and figure out what is happening, but I certainly don’t think the fine tuning code is buggy since it’s straight from the HF master branch examples, and it works fine for other datasets.
Any suggestions, ideas of what I can try?
Thanks in advance.