Hi.
In the Training a causal language model from scratch part of the NLP course, one can concatenate sequences with eos token for training CLM effectively.
As you increase the context size (or if you have a corpus of short documents), the fraction of chunks that are thrown away will also grow. A more efficient way to prepare the data is to join all the tokenized samples in a batch with an
eos_token_id
token in between, and then perform the chunking on the concatenated sequences. As an exercise, modify thetokenize()
function to make use of that approach. Note that youâll want to settruncation=False
and remove the other arguments from the tokenizer to get the full sequence of token IDs.
But if I concat multiple sentences with multiple EOS tokens in one training sequence, how can a model learn to stop generating a sequence? The sequence is continued after the EOS token so the model will never know it needs to stop after generating the EOS token.
And I print special tokens of openAI GPT2, thereâs no padding token.
generation_gpt2 = pipeline("text-generation", model="gpt2")
generation_gpt2.tokenizer.special_tokens_map
"""
Result:
{'bos_token': '<|endoftext|>',
'eos_token': '<|endoftext|>',
'unk_token': '<|endoftext|>'}
"""
Why there is no padding token and the bos, eos, and unk tokens are the same?