Text Generation, adding random words, weird linebreaks & symbols at random

steelhard · May 24, 2021, 7:42am

Here’s the code I’m using to generate text.

tokenizer = GPT2Tokenizer.from_pretrained("gpt2-medium") 
model = GPT2LMHeadModel.from_pretrained("gpt2-medium" , pad_token_id = tokenizer.eos_token_id)

sentence= tokenizer.encode(kw, return_tensors="pt") output = model.generate(sentence, max_length = 500, no_repeat_ngram_size = 2, do_sample=False) text.append(tokenizer.decode(output[0], skip_special_tokens = True))

The issue is that the output often comes like this:

"What are the benefits of using collagen?

,

,
,

,
, __________________, __________
The skin that has collagen has a higher level of hydrophilic (water-loving) proteins. `

or like this:

Yes, collagen is a natural skin-repairing substance. It is also a powerful anti-inflammatory and antiaging agent. , and, are the most common types of collagen found in skin.

As you can see, at the start it wrote “, and,” at random and it happens EXTREMELY often, nearly in every single text generation I did.

I don’t know if it’s related to my settings or not but I’d appreciate all the help you guys can give. I want to get my text to be as human-readable as possible & up to 100-500 words each input.

BramVanroy · May 24, 2021, 8:15am

It might help if you give more information about the model and tokenizer that you use.

steelhard · May 24, 2021, 9:03am

Sorry I forgot, edited the post

ktangri · May 24, 2021, 3:56pm

Could you provide an example input you’re using to run this (e.g. what is kw)? For context, I ran this exact code with the input “What are the benefits of using collagen?”, and my output was reasonable. Given that you’re using greedy decoding (do_sample=False) I’d expect the same behavior from the model which is strange.

steelhard · May 24, 2021, 6:41pm

Nevermind stupid mistkae from my part, it’s fixed not related to transformers at all

ktangri · May 24, 2021, 9:05pm

Awesome! Happy to hear that it’s resolved.

Need more comprehensive support specialized to your use-cases? has you covered! Through our Expert Acceleration Program your business can leverage our expertise to accelerate your NLP roadmap, from modeling to production.

Topic		Replies	Views
Prevent repeat tokens in GPT2LMHeadModel text generation with max_new_tokens=1 Beginners	0	1116	November 19, 2021
Text generation AI models generating repeated/duplicate text/sentences. What am I doing incorrectly? Hugging face models - Meta GALACTICA 🤗Transformers	1	1115	January 16, 2023
Write With Transformers XLNet Broken 🤗Transformers	6	448	August 13, 2020
How to fine-tune GPT on my own data for text generation Beginners	0	2188	January 17, 2022
GPT2 finetuning for text generation is getting overfitted Beginners	0	1109	August 27, 2021

Text Generation, adding random words, weird linebreaks & symbols at random

Related topics