Text Generation, adding random words, weird linebreaks & symbols at random

Here’s the code I’m using to generate text.

tokenizer = GPT2Tokenizer.from_pretrained("gpt2-medium") 
model = GPT2LMHeadModel.from_pretrained("gpt2-medium" , pad_token_id = tokenizer.eos_token_id)

sentence= tokenizer.encode(kw, return_tensors="pt") output = model.generate(sentence, max_length = 500, no_repeat_ngram_size = 2, do_sample=False) text.append(tokenizer.decode(output[0], skip_special_tokens = True))

The issue is that the output often comes like this:

"What are the benefits of using collagen?



, __________________, __________
The skin that has collagen has a higher level of hydrophilic (water-loving) proteins. `

or like this:

Yes, collagen is a natural skin-repairing substance. It is also a powerful anti-inflammatory and antiaging agent. , and, are the most common types of collagen found in skin.

As you can see, at the start it wrote “, and,” at random and it happens EXTREMELY often, nearly in every single text generation I did.

I don’t know if it’s related to my settings or not but I’d appreciate all the help you guys can give. I want to get my text to be as human-readable as possible & up to 100-500 words each input.

It might help if you give more information about the model and tokenizer that you use.

Sorry I forgot, edited the post

Could you provide an example input you’re using to run this (e.g. what is kw)? For context, I ran this exact code with the input “What are the benefits of using collagen?”, and my output was reasonable. Given that you’re using greedy decoding (do_sample=False) I’d expect the same behavior from the model which is strange.

Nevermind stupid mistkae from my part, it’s fixed not related to transformers at all

Awesome! Happy to hear that it’s resolved.

Need more comprehensive support specialized to your use-cases? :hugs: has you covered! Through our Expert Acceleration Program your business can leverage our expertise to accelerate your NLP roadmap, from modeling to production.