The T5 tokenizer ignores the multiple newlines and encode them as single whitespace â_â before next token:
model_name='google/flan-t5-base'
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)
tokenizer("Hello how are you\n\nI am fine").tokens()
['âHello', 'âhow', 'âare', 'âyou', 'âI', 'âam', 'âfine', '</s>']
If thats the case then why there are multiple newlines in the Flan-T5 tempelates here to follow. For example:
FewShotPattern(
inputs="Here is a dialogue:\n{dialogue}\n\nWrite a short summary!",
targets="{summary}",
inputs_prefix="",
targets_prefix="summary: ",
x_y_delimiter="\n\n",
example_separator="\n\n"),
...
Even the DeepLearning.AI course was suggesting to add â\n\n\nâ to separate shot examples:
def make_prompt(example_indices_full, example_index_to_summarize):
prompt = ''
for index in example_indices_full:
dialogue = dataset['test'][index]['dialogue']
summary = dataset['test'][index]['summary']
# The stop sequence '{summary}\n\n\n' is important for FLAN-T5. Other models may have their own preferred stop sequence.
prompt += f"""
Dialogue:
{dialogue}
What was going on?
{summary}
"""
but multiple newlines do not change the encoded input then why?