This HuggingFace blog article is a very useful introduction to configuring the various model.generate() methods for generating text. Most of these models accept a no_repeat_ngram_size
variable which specifies that the generated text may generate an ngram after that ngram has already been generated. This removes the problem of generative models repeating large swathes of document text.
However, it does imply that two improvements can be made to all generative models.
-
a list of ânegative topicâ ngrams could be provided to model.generate() to disable the generation of those ngrams before ever seeing them. For instance, if you want to generate an article on New York City, but do not want the article subject to be Covid-19, you could pass in a list of a few dozen n-grams on the topic of Covid-19.
-
a list of âapproved repeatsâ ngrams could also be provided. As stated in the blog, one of the trade-offs of the no_repeat_ngram_size approach is that documents which leverage long sequences of ngrams (such as the phrase âNew York Cityâ) will only ever use that phrase once if no_repeat_ngram_size < 3, despite the fact that an article on New York City would be expected to use it dozens of time.
The architecture of the OPTForCausalLM
implies that the architecture of the methods ought to handle this for all models. This makes this post a feature request.
Is there an appropriate GitHub to discuss such feature requests and/or bugs?
Can anybody add to this blog post for introductory resources on text generation?
Thank you. David