How does GPT decide to stop generating sentences without EOS token?

  1. Exactly, the model learns during training to not attend to tokens before <|endoftext| > when predicting the tokens after it.
  2. You always aim to add <|endoftext| > only in between two different samples that do not share relationship. If your samples are books, you add <|endoftext| > after the end of the first book, beginning of the next one. With that, the model won’t attend to the previous book when generating the next one.

Remember that <|endoftext| > serves two purposes: attending to the right context and knowing when to generate it (so we can stop the generation). But understand the principle: if your dataset only consists of chapters of one book, you can add <|endoftext| > in between chapters, no problem, since you have only one context (your book). In this case the model won’t learn that <|endoftext| > represents “don’t attend to the previous tokens”, because attending to the previous tokens actually helps the model, since the next chapter depends on the context of the previous one. So the model will just learn when a chapter ends, it should ouput the <|endoftext| > token.

In summary, the <|endoftext| > token is just a marker to help your model, you could add more mark tokens if you want, like end of sentence, end of paragraph, etc. When performing fine tuning for instruction we add special tokens like ### Response. In the past there was research in adding entity tags in the sentences, like:
[location] France is a good place to [action] work.

This is a good paper about end of sentence and end of paragraph tokens:

3 Likes