Processing Large Dataset for Training GPT2 model

shivamsharma2705 · April 5, 2023, 12:04am

I am working with a very big datasource (230M documents) and am trying to train a GPT2 style model using run_clm.py script with Deepspeed. There is a grouping function in the run_clm.py script (transformers/run_clm.py at main · huggingface/transformers · GitHub) which breaks the data into multiple tokens of max_sequence length.

Since my data is so big, the total time showing for me is around 10 days., which is way too much for pre-processing a data. Is there a way I can fasten up the process?

Topic		Replies	Views
.map() function extremely slow 🤗Datasets	1	1368	September 13, 2023
Tokenizer dataset is very slow 🤗Tokenizers	3	4405	March 2, 2024
Tokenizer performance is slow, after call to dataset map 🤗Datasets	0	174	June 15, 2024
Generating Vocabulary using Datasets 🤗Datasets	1	1462	August 30, 2022
Smarter way to load C4 dataset 🤗Datasets	1	811	November 6, 2023

Processing Large Dataset for Training GPT2 model

Related topics