How does GPT decide to stop generating sentences without EOS token?

  1. It does affect the model. We need the EOS token for two things:
  • To the model learn when a sentence ends, so it learns what words usually end sentences, for example.
  • To separate different sentences, so the model learns to not attend to the previous sentence.

Let’s say we have the following sample:
“I love my dog because he is funny. [EOS] The last time I was drinking, I was happy. [EOS]”
The model will use the [EOS] token to learn when to generate the [EOS] token (so the sentence ends).
And for the attention, for example, we will have the input:
"I love my dog because he is funny. [EOS] The last time I was " (correct = drinking)
The model will use the [EOS] token to learn to not attend to the first part of the input when predicting the next word.

  1. No. As mentioned above, the model learns to no attend to the previous sentence during training.

  2. The model will generate text infinitely. So let’s say you trained with the text of our previous example, if you start to generate passing “I”, the model will output “love”, then you pass “I love”, and it outputs “my”, and so on. When you pass “I love my dog because he is funny.” the model will output the EOS token, since it was what it learned during training. And if you pass now “I love my dog because he is funny. [EOS]” the model will output “The”.
    The model doesn’t know how to stop and never stops. What happens is that, during inference, we add an if statement to check if the output was the [EOS] token and we break the generation loop.

  3. Good question. That is something that is rarely taught. There are essentially two ways of creating the samples for training:

  • We concatenate all text that we have with the [EOS] token in between them (and tokenize).
  • We don’t concatenate, we simply have a dataset of samples.

For the first case, to generate the inputs for batches, we just get chunks of the tokenized dataset according to the block size parameter (context length). If it is 1024, we will get the first 1024 tokens for the first sample, then the next 1024 for the second, and so on. And yes, in this case we will end up with samples that have two or more sentences separated with the [EOS] token, just like our example above.

In the second case, we use each sample as a different input to the model. If we have a batch size of 1, we could simply pass each tokenized sample to the model and train it. But if we have a batch size of 4, all inputs inside each batch need to have the same length, that is related to how libraries like Pytorch are created to take advantage of batch processing on GPU’s. For that, we add the [PAD] token at the end of the sentences that don’t have the sufficient length inside the batch.

Both techniques work, and have advantages and disadvantages. For the first case, it is easier and faster to train, but adds another complexity for the model to learn (not attend to the previous sentence when we have the [EOS] token). For the second case, it is more natural and the model learns the context better, since it has each sample separated. But the training process is harder and slower (it uses unnecessary computation to predict the [PAD] tokens)

GPT models use the first case, that is why they don’t have [PAD] tokens.
You can actually check it by prompting ChatGPT with “Explain about <|endoftext>”. (Note that I passed the [EOS] token missing the character | before >, that is on purpose, since if you pass the actual <|endoftext|>, ChatGPT receives it as blank and can’t understand the question).
You will see that it starts to answer like “The <|endoftext|> …” and after that it simply answers with an uncorrelated text. That is because it learned to not attend to tokens that are before the [EOS] token.

12 Likes