-
I don’t understand the question. If the tokenizer was trained with an EOS token, it has the tokenizer.eos_token. Can you point where in the course page it says that the tokenizer doesn’t have it?
-
and 3.
Each input the model receives during training is in the following format:
“that a day on Venus lasts longer than a year on Venus? Due to its extremely slow rotation on its axis, a single day (sunrise to sunrise) on Venus lasts 243 Earth days. However, its orbit around the sun only takes about 225 Earth days, making a Venusian day longer than its year.<|endoftext|>Once upon a time, in a small, coastal town in Italy known as Belmare, lived an artist named Marco. This wasn’t your typical artist - Marco was a sculptor with a unique trait; he was completely blind. His condition had been with him”
As you can see, most of the time the sample won’t be a nice text, it will start with part of a text from the dataset, if it ends, it will have the EOS token and continue with other random text after that (probably will also be cut). This is exactly what the model will receive (tokenized) and try to generate a prediction for each position.
The model learns to not attend to the previous text by using the EOS token. When it receives “that a day on […] <|endoftext|>Once” it will try to generate the token “upon” but it will learn during training that if it received the EOS token before, it shouldn’t use the information before it to predict. It is part of the learning process, and is not manually told to the model.
About your question 3, each text inside the sample is not the same as each sentence, if you are referring a sentence as something in between dots, like “My dog is awesome.”. We only put the EOS token in between entire texts, not sentences. If I have a story about a dog and it has many sentences, I will only put the EOS token before and after the entire text, so we separate to the next random text that we have in our dataset.
With that, the model will learn how long each response needs to be (when to generate the EOS token). So if you have a lot of long stories about dogs in your dataset, and you start generating a dog story, the model will generate the entire story, with many sentences on it, and only generate the EOS token when it thinks makes sense to end the story.
After the EOS token is generated, all new generation will be random, since it learned to not attend to any tokens before the EOS when generating next ones. As I mentioned, you can test it by asking ChatGPT: “Explain about <|endoftext>”. It will give you a random answer after it outputs the EOS token.