As far as I know, when processing raw text into chunks with a fixed context window for training a GPT model, each context window does not overlap with each other, for example:
raw text: A B C D E F G H context windows (non-overlap): [A B C D] [E F G H]
Is there any reason not to use a overlap context window like:
context windows (overlap): [A B C D] [C D E F] [E F G H]
The intuitive idea is that the overlapping context window can assist the model in learning more contiguous information without being influenced by truncation. Is there any research that supports this notion?