Training BERT model from scratch with custom sequence


I have a text file (separated by whitespace) containing sequences of strings (they are not words, rather the data belongs to a different domain).
I wanted to try pre-training a BERT model with this data.

I have seen tutorials with folks fine-tuning on different target datasets but I don’t think there is a official tutorial for pretraining with custom data. I think I need to index the data sequences first and then do the pre-training.

Are there any resources that could help me with this? Any pointers would be much appreciated.