About training data pre-processing

jyang18 · March 2, 2023, 2:40am

Hello

I have a dataset with different lengths for each piece of data, and when I Tokenizer it, the lengths of each piece are different. When I feed the tokenized data into the GPT-2 model for training, an error occurs.

Do I have the same length of data for the final GPT-2 model to be trained?

Thanks.

Topic		Replies	Views
Training GPT-2 from scratch Beginners	2	1228	August 3, 2020
Fine tuning and retokenizing Beginners	0	588	May 29, 2022
Token indices sequence length is longer (Python) 🤗Transformers	0	344	April 13, 2023
How did the dataset manages long sentences? 🤗Datasets	1	984	February 15, 2022
Building a GPT2 dataset from long sequences 🤗Datasets	1	515	September 19, 2022

About training data pre-processing

Related topics