Strange sequence generation with xsum-distillbart (clumped tokens)

tanyaroosta · February 28, 2022, 7:18am

Hi,
I am using the run_summarization.py without any modifications. I use sshleifer/distilbart-xsum-12-6 as base, and use the natural questions short answer dataset from HF to fine-tune the model. However, when I do generation for an input testset, I see a very strange behavior. Here is an example:

predicted question	original question
isis it easy for 5 year olds to play?	is this fun for 2 players?
doesdoes it work with windows 8.1?	can this be used to seal/finish a canvas acrylic painting instead of gloss varnish?
isdoes this the right size for an 18 month old?	is this worth the money?
whathow many and what size batteries are needed or included?	what type of batteries and how many are required for both handsets?

As you can see, the first token is a clumped version of two words, where the second word seems to be the right one.

Also, I have fine tuned this model with other datasets, and never seen this behavior with any other dataset, except the one I mentioned. I have looked at the vocab.json and none of these tokens are in the vocab, so it seems they are OOV generations? I have no idea how to debug this and figure out what is happening, but I certainly don’t think the fine tuning code is buggy since it’s straight from the HF master branch examples, and it works fine for other datasets.

Any suggestions, ideas of what I can try?

Thanks in advance.

Topic		Replies	Views
Tokenizer for 'sshleifer/distilbart-xsum-12-6'? Beginners	2	304	August 18, 2020
Seq2Seq Distillation: train_distilbart_xsum error 🤗Transformers	5	436	November 10, 2020
How does summarization work with pretrained models? 🤗Transformers	0	587	November 14, 2023
Cannot reproduce the results Beginners	5	882	October 5, 2020
Using the jpelhaw / t5-word-sense-disambiguation model Beginners	2	685	April 14, 2022

Strange sequence generation with xsum-distillbart (clumped tokens)

Related topics