Hi,
So as the title says, I want to generate text without using any prompt text, just based on what the model learned from the training dataset. I tried by giving a single space as the input prompt but it did not work.
So I tried below:
prompt_text = ' '
encoded_prompt = tokenizer.encode(prompt_text, add_special_tokens=False, return_tensors="pt")
output_sequences = model.generate(
input_ids=encoded_prompt,
max_length=50 + len(encoded_prompt[0]),
temperature=0.7,
top_k=0,
top_p=0.9,
repetition_penalty=1.0,
do_sample=True,
num_return_sequences=5,
)
and got the error:
RuntimeError: cannot reshape tensor of 0 elements into shape [-1, 0] because the unspecified dimension size -1 can be any value and is ambiguous
Thanks
You can wrap your samples in special tokens e.g. <|startoftext|> and <|endoftext|>.
Then you can prompt the model by feeding it <|startoftext|> and stop the generation at <|endoftext|>.
1 Like
Thanks, so will this be the correct input:
prompt_text = '<|startoftext|> '
encoded_prompt = tokenizer.encode(prompt_text, add_special_tokens=False, return_tensors="pt")
output_sequences = model.generate(
input_ids=encoded_prompt,
max_length=50 + len(encoded_prompt[0]),
temperature=0.7,
top_k=0,
top_p=0.9,
repetition_penalty=1.0,
do_sample=True,
num_return_sequences=5,
)
Because doing this gives me a lot of other random tokens in some of the generated outputs such as:
<|startoftext|>cringe|<end of text|>cringe|<|last text|>cringe|<|end of text|>cringe|<|last text|>cringe|<|last text|>cringe|
and
<|startoftext|> 6 years old; i remember this well, remember its a good thing when you’re old enough to remember that there’s something bette
The outputs are of much poor quality than when I provide an input prompt text??
I’ve never had this problem. Did you run the fine-tuning again with the now wraped samples?
No, I didn’t do that. Would I have to re-run the full training? and change the dataset where every sentence starts with this token?
You don’t have to re-run the entire training, just your fine-tuning.
Yes, you’d have to change the dataset your training with to include those tokens.