Is it possible to generate GPT2 output without an input prompt text

Hi,

So as the title says, I want to generate text without using any prompt text, just based on what the model learned from the training dataset. I tried by giving a single space as the input prompt but it did not work.

So I tried below:

prompt_text = ' '

encoded_prompt = tokenizer.encode(prompt_text, add_special_tokens=False, return_tensors="pt")
output_sequences = model.generate(
    input_ids=encoded_prompt,
    max_length=50 + len(encoded_prompt[0]),
    temperature=0.7,
    top_k=0,
    top_p=0.9,
    repetition_penalty=1.0,
    do_sample=True,
    num_return_sequences=5,
)

and got the error:

RuntimeError: cannot reshape tensor of 0 elements into shape [-1, 0] because the unspecified dimension size -1 can be any value and is ambiguous

Thanks

You can wrap your samples in special tokens e.g. <|startoftext|> and <|endoftext|>.
Then you can prompt the model by feeding it <|startoftext|> and stop the generation at <|endoftext|>.

1 Like

Thanks, so will this be the correct input:

prompt_text = '<|startoftext|> '

encoded_prompt = tokenizer.encode(prompt_text, add_special_tokens=False, return_tensors="pt")
output_sequences = model.generate(
    input_ids=encoded_prompt,
    max_length=50 + len(encoded_prompt[0]),
    temperature=0.7,
    top_k=0,
    top_p=0.9,
    repetition_penalty=1.0,
    do_sample=True,
    num_return_sequences=5,
)

Because doing this gives me a lot of other random tokens in some of the generated outputs such as:

<|startoftext|>cringe|<end of text|>cringe|<|last text|>cringe|<|end of text|>cringe|<|last text|>cringe|<|last text|>cringe|

and

<|startoftext|> 6 years old; i remember this well, remember its a good thing when you’re old enough to remember that there’s something bette

The outputs are of much poor quality than when I provide an input prompt text??

I’ve never had this problem. Did you run the fine-tuning again with the now wraped samples?

No, I didn’t do that. Would I have to re-run the full training? and change the dataset where every sentence starts with this token?

You don’t have to re-run the entire training, just your fine-tuning.
Yes, you’d have to change the dataset your training with to include those tokens.