Original Bert Pretraining


I would like to repeat the original BERT pre-training with my data. I was trying to use the example script of the huggingface library (transformers/run_mlm_no_trainer.py at master · huggingface/transformers · GitHub).
The problem is that the data are provided differently to the model: in the original pre-training the sentences were not real sentences but chunks of text with length “max_sequence-length”. This way of providing data is used if you set the parameter “line_by_line” to false but here arrives my problem.
Having a huge dataset and training in a GPU cluster with a limit to execution time (24h then you can restart from a training checkpoint), I must tokenize the sentences on-the-fly (as explained in 1.3GB dataset creates over 107GB of cache file! · Issue #10204 · huggingface/transformers · GitHub) and then I can’t use the “group_texts” function in advance because it requires tokenization.

How can i solve this problem? Are you aware of a pre-training script that follows the original pre-training and can be used with huge dataset with limited execution time?

Thanks in advance!

EDIT: tokenizing all the dataset in advance and saving it this way is not an option because I need to keep all the tokenization informations for other experiments and, as reported in the previous linked issue, tokenization creates GB (in my case wolud be TB) of files!