I was going through this article from the NLP course: Training a causal language model from scratch - Hugging Face NLP Course
I see that there are two strategies here based on the context length v/s the sequence length:
- If the sequence length is much longer than the context length, we can chunk the sequence into samples of context length and either discard or pad the odd samples in the end
- If the context length is relative or greater to the size of the input sequence, then we combine samples using a EOS token and chunk using this one big concatenated sample
My question is, in the second strategy (concat using EOS and chunk), how does the model differentiate between the samples?
For example, in the case of a Causal LM, with a mixed sample, the model will just learn to predict a part of the next sample using the previous which may or may not make sense. Similarly in MLM, the self-attention will be applied across all tokens, i.e., it will learn attention weights for next → prev sample as well as prev → next sample.
So is there a special attention mask that can be used to avoid this? Or do we have some special positional embeddings that we can add to the original embeddings?
I am confused because I believe for a large enough corpus, this will occur a lot of time (enough for the model to learn wrongly)
Or is it that since LMs are trained on entire corpuses of data (like books or entire web pages on the internet), adjacent sequences will be continuous with a high probability and it makes sense to keep them in context (And consequently, the case where say, the final paragraph from one book concatenated with the first of a different book will seldom happen and hence won’t affect the weights much).
Is this the reason?