Why follow Flan-T5 template when T5 tokenizer ignores multiple newlines

Sohaib9920 · May 15, 2024, 11:12pm

The T5 tokenizer ignores the multiple newlines and encode them as single whitespace ‘_’ before next token:

model_name='google/flan-t5-base'
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)
tokenizer("Hello how are you\n\nI am fine").tokens()

['▁Hello', '▁how', '▁are', '▁you', '▁I', '▁am', '▁fine', '</s>']

If thats the case then why there are multiple newlines in the Flan-T5 tempelates here to follow. For example:

FewShotPattern(
            inputs="Here is a dialogue:\n{dialogue}\n\nWrite a short summary!",
            targets="{summary}",
            inputs_prefix="",
            targets_prefix="summary: ",
            x_y_delimiter="\n\n",
            example_separator="\n\n"),
...

Even the DeepLearning.AI course was suggesting to add ‘\n\n\n’ to separate shot examples:

def make_prompt(example_indices_full, example_index_to_summarize):
    prompt = ''
    for index in example_indices_full:
        dialogue = dataset['test'][index]['dialogue']
        summary = dataset['test'][index]['summary']
        
        # The stop sequence '{summary}\n\n\n' is important for FLAN-T5. Other models may have their own preferred stop sequence.
        prompt += f"""
Dialogue:

{dialogue}

What was going on?
{summary}


"""

but multiple newlines do not change the encoded input then why?

Topic		Replies	Views
Tokenizer ignores repeated whitespaces 🤗Tokenizers	3	3324	May 19, 2022
Minimum number of tokens in generate Models	0	1064	March 10, 2023
Flan-T5 - Finetuning to a Longer Sequence Length (512 -> 2048 tokens): Will it work? Beginners	3	4199	January 9, 2024
FLAN-T5 models tokenizer can't encode some of 60 supported languages Models	3	984	June 23, 2024
2 tokens for one character in T5 🤗Tokenizers	2	1620	August 10, 2023

Why follow Flan-T5 template when T5 tokenizer ignores multiple newlines

Related topics