I can't understand why generative models make repetitions

The screenshot is the bloom output with a prompt “Please unscramble the letters into a word, and write that word:\nr e!c.i p r o.c a/l =”. (Bloom Book - a Hugging Face Space by bigscience, 2022-06-16, the frist sentence)

I think it makes sense that the model infers ‘The word is “RECIPROCAL”.\nThe’ after the prompt.

But what I can’t understand is why the token ‘word’ is the best choice for a next token after “Please unscramble the letters into a word, and write that word:\nr e!c.i p r o.c a/l = r e!c.i p r o.c a/l\nThe word is “RECIPROCAL”.\nThe”

This is unreasonable because in training dataset, I believe this type repetition is a rare case. Obviously, sequences after second 'word' barely occur in training dataset.

So, my question is
even if repetition of tokens (words, or sentences) is a rare case in datasets, why model acts like repetition is the best case.

This is a very common issue in NLG, repetition and hallucinations are two of the biggest problems in NLG. These problems are not “solved” and are likely inherent to the probabilistic nature of language models. There is a lot of related work on this so if you just search for “natural language generation repetition” you’ll come across a lot of literature. Here is a start.

1 Like


Thanks for reply!

I’ve read a few papers, I am still curious about token repetition.

As mentioned in the nucleus sampling paper, the probability of a repeated phrase increases with each repetition.

It is easy to see this behavior in many generation models if check probabilities of repetition tokens.

What mechanisms help the generation models to provide positive feedback to the repeated phrase and give high probabilities to some of input tokens?