Processing Large Dataset for Training GPT2 model

shivamsharma2705 · April 5, 2023, 12:04am

I am working with a very big datasource (230M documents) and am trying to train a GPT2 style model using run_clm.py script with Deepspeed. There is a grouping function in the run_clm.py script (transformers/run_clm.py at main · huggingface/transformers · GitHub) which breaks the data into multiple tokens of max_sequence length.

Since my data is so big, the total time showing for me is around 10 days., which is way too much for pre-processing a data. Is there a way I can fasten up the process?

lhoestq · April 6, 2023, 12:06pm

Hi ! Have you tried increasing preprocessing_num_workers ?

shivamsharma2705 · April 6, 2023, 3:15pm

Yeah, I have the maximum number of processors that my system has. Is there a way to replace this function with something which is faster?

lhoestq · April 7, 2023, 5:43pm

This function uses pythons lists, which are known to be slow. Using NumPy is significantly faster since it’s integrated with Apache Arrow - the format used to store the datasets.

Therefore you should manipulate NumPy arrays instead. To do that, use the “numpy” formatting:

tokenized_datasets = tokenized_datasets.with_format("numpy")
lm_datasets = tokenized_datasets.map(...)

And you can modify your map function to work with NumPy arrays only, instead of lists.

(it can even be slightly faster using the “arrow” formatting, but it would require to rewrite your map function completely so I wouldn’t suggest it)

conceptofmind · April 12, 2023, 12:32am

I was also noticing this “issue” as well when trying to map the Pile. The estimated time would have been ~10 days to tokenize on 48 proc. I will try with numpy.

Topic		Replies	Views
Running out of Memory with run_clm.py Beginners	3	1682	December 14, 2022
Speed issues using tokenizer.train_new_from_iterator on ~50GB dataset 🤗Transformers	7	2243	November 11, 2024
When using Dataset.map to tokenize a dataset, the speed slows down as the progress approaches 100% 🤗Datasets	3	900	December 23, 2024
Building a GPT2 dataset from long sequences 🤗Datasets	1	518	September 19, 2022
Extremely slow operation on dataset.map 🤗Datasets	0	298	June 27, 2024

Processing Large Dataset for Training GPT2 model

Related topics