How does GPT decide to stop generating sentences without EOS token?

edubs · July 22, 2023, 11:32pm

Exactly, the model learns during training to not attend to tokens before <|endoftext| > when predicting the tokens after it.
You always aim to add <|endoftext| > only in between two different samples that do not share relationship. If your samples are books, you add <|endoftext| > after the end of the first book, beginning of the next one. With that, the model won’t attend to the previous book when generating the next one.

Remember that <|endoftext| > serves two purposes: attending to the right context and knowing when to generate it (so we can stop the generation). But understand the principle: if your dataset only consists of chapters of one book, you can add <|endoftext| > in between chapters, no problem, since you have only one context (your book). In this case the model won’t learn that <|endoftext| > represents “don’t attend to the previous tokens”, because attending to the previous tokens actually helps the model, since the next chapter depends on the context of the previous one. So the model will just learn when a chapter ends, it should ouput the <|endoftext| > token.

In summary, the <|endoftext| > token is just a marker to help your model, you could add more mark tokens if you want, like end of sentence, end of paragraph, etc. When performing fine tuning for instruction we add special tokens like ### Response. In the past there was research in adding entity tags in the sentences, like:
[location] France is a good place to [action] work.

This is a good paper about end of sentence and end of paragraph tokens:

Topic		Replies	Views
GPT2 finetuned with eos token will never yield eos token during generation Beginners	6	3335	April 12, 2024
Controlled Text Generation 🤗Transformers	2	2570	March 26, 2022
Exhaustive list of changes across all touchpoints in the tokenization pipeline of LM training 🤗Transformers	0	288	June 26, 2023
Seeking an end-to-end example of grouping, tokenization and padding to construct preprocessed data in HF 🤗Tokenizers	0	391	June 26, 2023
GPT-2 special tokens Models	2	1848	February 20, 2024

How does GPT decide to stop generating sentences without EOS token?

Related topics