Question about GPT's data preprocess for training

p208p2002 · July 4, 2023, 1:01am

As far as I know, when processing raw text into chunks with a fixed context window for training a GPT model, each context window does not overlap with each other, for example:

raw text:
A B C D E F G H

context windows (non-overlap):
[A B C D] [E F G H]

Is there any reason not to use a overlap context window like:

context windows (overlap):
[A B C D] [C D E F] [E F G H]

The intuitive idea is that the overlapping context window can assist the model in learning more contiguous information without being influenced by truncation. Is there any research that supports this notion?

Topic	Replies	Views
How to separate sequences during finetuning gpt Beginners	292	December 19, 2020
NLP for taking a string of text and identifying duplications and building a new string without the duplication Beginners	158	March 11, 2023
About training data pre-processing Beginners	186	March 2, 2023
GPT2 long text approach 🤗Tokenizers	556	December 20, 2022
GPT-2 Data Preparation for Parsing Trees Intermediate	123	May 6, 2024

Question about GPT's data preprocess for training

Related topics