Here’s the code I’m using to generate text.
tokenizer = GPT2Tokenizer.from_pretrained("gpt2-medium")
model = GPT2LMHeadModel.from_pretrained("gpt2-medium" , pad_token_id = tokenizer.eos_token_id)
sentence= tokenizer.encode(kw, return_tensors="pt") output = model.generate(sentence, max_length = 500, no_repeat_ngram_size = 2, do_sample=False) text.append(tokenizer.decode(output[0], skip_special_tokens = True))
The issue is that the output often comes like this:
"What are the benefits of using collagen?
,
,
,
,
, __________________, __________
The skin that has collagen has a higher level of hydrophilic (water-loving) proteins. `
or like this:
Yes, collagen is a natural skin-repairing substance. It is also a powerful anti-inflammatory and antiaging agent. , and, are the most common types of collagen found in skin.
As you can see, at the start it wrote “, and,” at random and it happens EXTREMELY often, nearly in every single text generation I did.
I don’t know if it’s related to my settings or not but I’d appreciate all the help you guys can give. I want to get my text to be as human-readable as possible & up to 100-500 words each input.