GPT2 long text approach

misterkilgore · December 20, 2022, 10:54am

Hi everyone. I was wondering, what’s the best way to approach a long text for gpt2 training?

Suppose the text is 10000 tokens long and the max length for the tokenizer is 2048.

It’s better to split the corpus in 5 sequential samples (first sample from token 0 to 2047, second sample from token 2048 to 4095…) or in 7952 overlapping samples (first sample form token 0 to token 2047, second sample from token 1 to 2048… last sample from token 7952 to token 10000)?

I’ve seen people using both approaches.

Topic		Replies	Views
Building a GPT2 dataset from long sequences 🤗Datasets	1	515	September 19, 2022
Training GPT-2 from scratch Beginners	2	1228	August 3, 2020
How did the dataset manages long sentences? 🤗Datasets	1	984	February 15, 2022
Text classification training on long text Intermediate	3	4938	June 18, 2024
How to train with very long sequences? Beginners	2	687	May 20, 2022

GPT2 long text approach

Related topics