Hi everyone. I was wondering, what’s the best way to approach a long text for gpt2 training?
Suppose the text is 10000 tokens long and the max length for the tokenizer is 2048.
It’s better to split the corpus in 5 sequential samples (first sample from token 0 to 2047, second sample from token 2048 to 4095…) or in 7952 overlapping samples (first sample form token 0 to token 2047, second sample from token 1 to 2048… last sample from token 7952 to token 10000)?
I’ve seen people using both approaches.