How does GPT decide to stop generating sentences without EOS token?

Hi.
In the Training a causal language model from scratch part of the NLP course, one can concatenate sequences with eos token for training CLM effectively.

As you increase the context size (or if you have a corpus of short documents), the fraction of chunks that are thrown away will also grow. A more efficient way to prepare the data is to join all the tokenized samples in a batch with an eos_token_id token in between, and then perform the chunking on the concatenated sequences. As an exercise, modify the tokenize() function to make use of that approach. Note that you’ll want to set truncation=False and remove the other arguments from the tokenizer to get the full sequence of token IDs.

But if I concat multiple sentences with multiple EOS tokens in one training sequence, how can a model learn to stop generating a sequence? The sequence is continued after the EOS token so the model will never know it needs to stop after generating the EOS token.

And I print special tokens of openAI GPT2, there’s no padding token.

generation_gpt2 = pipeline("text-generation", model="gpt2")
generation_gpt2.tokenizer.special_tokens_map

"""
Result:
{'bos_token': '<|endoftext|>',
 'eos_token': '<|endoftext|>',
 'unk_token': '<|endoftext|>'}
"""

Why there is no padding token and the bos, eos, and unk tokens are the same?

In the context of training a causal language model (CLM) using sequences with EOS tokens, it’s important to note that the presence of the EOS token primarily serves as a signal during training. It helps the model learn the concept of sentence boundaries but does not explicitly instruct the model to stop generating text.

When multiple sentences are concatenated with multiple EOS tokens in a single training sequence, the model indeed continues generating text after each EOS token. However, the training process still allows the model to learn the patterns and dependencies within and across sentences. Over time, the model can learn to generate coherent and meaningful text, even if there are multiple EOS tokens within the training sequence. The process of learning to stop generating text at appropriate points is a result of the model’s training on a large dataset that contains diverse examples of sentence structures.

Regarding the special tokens in OpenAI’s GPT-2 model, it is important to understand that different language models may have different configurations for their special tokens. In the case of GPT-2, the absence of a padding token and the same tokens for BOS (beginning of sequence), EOS (end of sequence), and UNK (unknown) indicate the model’s specific design choices.

While padding tokens can be useful in certain applications, GPT-2 does not use padding tokens as it does not require fixed-length input sequences. Instead, the model processes input dynamically based on the actual length of the input sequence.

Regarding the BOS, EOS, and UNK tokens being the same, it suggests that GPT-2 does not differentiate between these special tokens and treats them as equivalent. This means that the model does not assign any special meaning to these tokens during generation.

It’s important to note that the absence of a padding token and the design choice of treating BOS, EOS, and UNK tokens as the same do not impact the model’s ability to generate text effectively or learn meaningful representations from the training data. The model’s training process and the patterns within the data enable it to generate coherent and contextually relevant text.

I really appreciate you for the nice explanation.
But a few questions come to my mind:

  1. Then why do we need EOS token? What’s the role of the EOS token in GPT2 if it doesn’t effect the model?

  2. If I understand below correctly, each training sequence for CLM should be from the same source because the model needs to learn to generate sentences coherently. Am I right?

  1. What is the logic of GPT’s sentence generation? How can the model know when it stop?
  1. According to the below quote, even if the model processes input dynamically based on the actual length of the input sequence, the lengths of each input sequence in a batch can be different. I thought that it’s the reason we need a pad token.

Thank you!!

  1. It does affect the model. We need the EOS token for two things:
  • To the model learn when a sentence ends, so it learns what words usually end sentences, for example.
  • To separate different sentences, so the model learns to not attend to the previous sentence.

Let’s say we have the following sample:
“I love my dog because he is funny. [EOS] The last time I was drinking, I was happy. [EOS]”
The model will use the [EOS] token to learn when to generate the [EOS] token (so the sentence ends).
And for the attention, for example, we will have the input:
"I love my dog because he is funny. [EOS] The last time I was " (correct = drinking)
The model will use the [EOS] token to learn to not attend to the first part of the input when predicting the next word.

  1. No. As mentioned above, the model learns to no attend to the previous sentence during training.

  2. The model will generate text infinitely. So let’s say you trained with the text of our previous example, if you start to generate passing “I”, the model will output “love”, then you pass “I love”, and it outputs “my”, and so on. When you pass “I love my dog because he is funny.” the model will output the EOS token, since it was what it learned during training. And if you pass now “I love my dog because he is funny. [EOS]” the model will output “The”.
    The model doesn’t know how to stop and never stops. What happens is that, during inference, we add an if statement to check if the output was the [EOS] token and we break the generation loop.

  3. Good question. That is something that is rarely taught. There are essentially two ways of creating the samples for training:

  • We concatenate all text that we have with the [EOS] token in between them (and tokenize).
  • We don’t concatenate, we simply have a dataset of samples.

For the first case, to generate the inputs for batches, we just get chunks of the tokenized dataset according to the block size parameter (context length). If it is 1024, we will get the first 1024 tokens for the first sample, then the next 1024 for the second, and so on. And yes, in this case we will end up with samples that have two or more sentences separated with the [EOS] token, just like our example above.

In the second case, we use each sample as a different input to the model. If we have a batch size of 1, we could simply pass each tokenized sample to the model and train it. But if we have a batch size of 4, all inputs inside each batch need to have the same length, that is related to how libraries like Pytorch are created to take advantage of batch processing on GPU’s. For that, we add the [PAD] token at the end of the sentences that don’t have the sufficient length inside the batch.

Both techniques work, and have advantages and disadvantages. For the first case, it is easier and faster to train, but adds another complexity for the model to learn (not attend to the previous sentence when we have the [EOS] token). For the second case, it is more natural and the model learns the context better, since it has each sample separated. But the training process is harder and slower (it uses unnecessary computation to predict the [PAD] tokens)

GPT models use the first case, that is why they don’t have [PAD] tokens.
You can actually check it by prompting ChatGPT with “Explain about <|endoftext>”. (Note that I passed the [EOS] token missing the character | before >, that is on purpose, since if you pass the actual <|endoftext|>, ChatGPT receives it as blank and can’t understand the question).
You will see that it starts to answer like “The <|endoftext|> …” and after that it simply answers with an uncorrelated text. That is because it learned to not attend to tokens that are before the [EOS] token.

1 Like

Thanks for the nice answer.

  1. According to Training a causal language model from scratch - Hugging Face NLP Course, the tokenizer doesn’t return eos token. Why is that?
  2. How can the model not learn to attend to the previous sentence during training? I believe the gpt2 tokenizer’s attention mask is [1, …, 1] which means there’s no difference between sentences (from the same source above).
  3. Ok. Then are there any techniques to generate much ‘nicer’ sentences? Let’s say I stop at the first eos token. Then the model generates only one single sentence. But only one sentence is not enough so someone may want to generate multiple sentences for a much nicer answer.

I really appreciate your kind answer. It helps a lot.

  1. I don’t understand the question. If the tokenizer was trained with an EOS token, it has the tokenizer.eos_token. Can you point where in the course page it says that the tokenizer doesn’t have it?

  2. and 3.

Each input the model receives during training is in the following format:

“that a day on Venus lasts longer than a year on Venus? Due to its extremely slow rotation on its axis, a single day (sunrise to sunrise) on Venus lasts 243 Earth days. However, its orbit around the sun only takes about 225 Earth days, making a Venusian day longer than its year.<|endoftext|>Once upon a time, in a small, coastal town in Italy known as Belmare, lived an artist named Marco. This wasn’t your typical artist - Marco was a sculptor with a unique trait; he was completely blind. His condition had been with him”

As you can see, most of the time the sample won’t be a nice text, it will start with part of a text from the dataset, if it ends, it will have the EOS token and continue with other random text after that (probably will also be cut). This is exactly what the model will receive (tokenized) and try to generate a prediction for each position.

The model learns to not attend to the previous text by using the EOS token. When it receives “that a day on […] <|endoftext|>Once” it will try to generate the token “upon” but it will learn during training that if it received the EOS token before, it shouldn’t use the information before it to predict. It is part of the learning process, and is not manually told to the model.

About your question 3, each text inside the sample is not the same as each sentence, if you are referring a sentence as something in between dots, like “My dog is awesome.”. We only put the EOS token in between entire texts, not sentences. If I have a story about a dog and it has many sentences, I will only put the EOS token before and after the entire text, so we separate to the next random text that we have in our dataset.

With that, the model will learn how long each response needs to be (when to generate the EOS token). So if you have a lot of long stories about dogs in your dataset, and you start generating a dog story, the model will generate the entire story, with many sentences on it, and only generate the EOS token when it thinks makes sense to end the story.

After the EOS token is generated, all new generation will be random, since it learned to not attend to any tokens before the EOS when generating next ones. As I mentioned, you can test it by asking ChatGPT: “Explain about <|endoftext>”. It will give you a random answer after it outputs the EOS token.

  1. Nooo. You are right. The tokens’ length is too long so I missed the eos token.
  2. Ok. But how is that possible to model not attend to the previous sentences? We use the self-attention mechanism which means that the tokens after <|endoftext|> will attend to previous tokens. Do you mean that the model detects the coherency between the sentences so that the model doesn’t attend to sentences after/before <|endoftext|> because they are unnatural? Can you give me some reference about it?
  3. Then if I train GPT-2 with book corpus, the eos token will be added after the end of an entire book or just a chapter? Which criteria are widely used (standard) for training GPT?
  1. Exactly, the model learns during training to not attend to tokens before <|endoftext| > when predicting the tokens after it.
  2. You always aim to add <|endoftext| > only in between two different samples that do not share relationship. If your samples are books, you add <|endoftext| > after the end of the first book, beginning of the next one. With that, the model won’t attend to the previous book when generating the next one.

Remember that <|endoftext| > serves two purposes: attending to the right context and knowing when to generate it (so we can stop the generation). But understand the principle: if your dataset only consists of chapters of one book, you can add <|endoftext| > in between chapters, no problem, since you have only one context (your book). In this case the model won’t learn that <|endoftext| > represents “don’t attend to the previous tokens”, because attending to the previous tokens actually helps the model, since the next chapter depends on the context of the previous one. So the model will just learn when a chapter ends, it should ouput the <|endoftext| > token.

In summary, the <|endoftext| > token is just a marker to help your model, you could add more mark tokens if you want, like end of sentence, end of paragraph, etc. When performing fine tuning for instruction we add special tokens like ### Response. In the past there was research in adding entity tags in the sentences, like:
[location] France is a good place to [action] work.

This is a good paper about end of sentence and end of paragraph tokens:

1 Like

It helps a looooooooooooooot! Thank you very much.
I have one last question. In the serving (inference) environment, we take inputs as batches because of the efficiency of the GPUs. But input lengths in the requests vary so I think the system needs the PAD tokens. Then we just add the PAD token? How can we deal with various input lenghts requests?

Sure.
Yes we pad the sentences that are shorter. Most models don’t use a specific PAD token, they just use the <|endoftext|> token. And you also don’t need to pass an attention mask during inference. Here is why:
Lets say we have: " I love my mom because <|endoftext|> <|endoftext|>
We are interested in predicting the word after “because”, and since GPT-like models don’t attend to future tokens, it won’t attend to the pad tokens anyway.
The only thing you need to know is where is the prediction before the pad tokens.
If we consider the example above as each word is a token, the model would give back the result, for example:
[love, him, friend, so, she, The, For]
Remember that the outputs don’t have relationships themselves, since they are predictions for each position from your input.
The first output “love” only saw “I”
The second output “him” only saw “I love”
The third output “friend” only saw “I love my”

We are interested in the output “she” that is the 5th position and only saw “I love my mom because”. So if you have 4 sentences like:

“I love my mom because <|endoftext|> <|endoftext|>”
“The best way to <|endoftext|> <|endoftext|> <|endoftext|>”
“When I was a kid I liked”
“This <|endoftext|> <|endoftext|> <|endoftext|> <|endoftext|> <|endoftext|> <|endoftext|>”

Note that all sentences have 7 words when padded.
You are interested in the positions [4, 3, 6, 0] (starting from 0). You just need to create this list before you pass the inputs so you know which predictions you should be looking at.

1 Like

Thank you for the nice answer. You are such a wonderful teacher.

1 Like

Hello to everyone,
I think this is the best conversation I have found so far which clarifies the use of special tokens in order to fine tune gpt2 on custom context. In order to be sure that I have a good understanding of what has been mentioned I would like someone to clarify/disaprove the following. I have found a pretrained gpt2 model trained on the Greek language from higgingface named nikokons/gpt2-greek and I want to fine tune it on my custom dataset. My dataset consists of samples of mathematical definitions with related questions written in the Greek language. Let me give some translated examples

Definition: Two angles are called complementary angles when they sum up to 180 degrees.
Question: What are complementary angles?

Definition: Two angles are called complementary angles when they sum up to 180 degrees.
Question: How do we call two angles which sum up to 180 degrees?

Definition: A triangle is called isosceles when it has two sides of equally length.
Question: What is an isosceles triangle?

Definition: A triangle is called isosceles when it has two sides of equally length.
Question: What do we call a triangle which has two equally in length sides?

Notice that for a Definition I might have multiple questions on my dataset. I want to fine tune the model in order to learn to answer the user’s question by answering to the user with the entire Definition related to the user’s question.

What are the steps I should follow?
First fine tune the model to the raw dataset ( I mean the dataset without special tokens) in order to learn the new terminology and then preprocess the dataset in order to add in the beginning and at the ending of each sample the
|endoftext| token and finetune the model again on the new preprocessed dataset?

the processed dataset would be like following?

|endoftext| A triangle is called isosceles when it has two sides of equally length. What is an isosceles triangle? |endoftext|
Two angles are called complementary angles when they sum up to 180 degrees.
How do we call two angles which sum up to 180 degrees?|endoftext|