Processing Large Dataset for Training GPT2 model

I am working with a very big datasource (230M documents) and am trying to train a GPT2 style model using run_clm.py script with Deepspeed. There is a grouping function in the run_clm.py script (transformers/run_clm.py at main · huggingface/transformers · GitHub) which breaks the data into multiple tokens of max_sequence length.

Since my data is so big, the total time showing for me is around 10 days., which is way too much for pre-processing a data. Is there a way I can fasten up the process?

Hi ! Have you tried increasing preprocessing_num_workers ?

Yeah, I have the maximum number of processors that my system has. Is there a way to replace this function with something which is faster?

This function uses pythons lists, which are known to be slow. Using NumPy is significantly faster since it’s integrated with Apache Arrow - the format used to store the datasets.

Therefore you should manipulate NumPy arrays instead. To do that, use the “numpy” formatting:

tokenized_datasets = tokenized_datasets.with_format("numpy")
lm_datasets = tokenized_datasets.map(...)

And you can modify your map function to work with NumPy arrays only, instead of lists.

(it can even be slightly faster using the “arrow” formatting, but it would require to rewrite your map function completely so I wouldn’t suggest it)

I was also noticing this “issue” as well when trying to map the Pile. The estimated time would have been ~10 days to tokenize on 48 proc. I will try with numpy.