Train a large transformer with Custom Tokenizer/Data

I’d like to train large transformer model with my own data and tokenizer. I have billions of examples, so I’d like to avoid using anything that’s pre-trained, I just want to train from scratch. Ideally I’d like to use tf/keras (hoping to upload to Google Cloud and train with TPUs there since I already have TPUs).

What’s the easiest way to do this?

I can think of several options:

  1. I dump a dataset where each document is just sequence of integers and the model doesn’t care about tokenization etc… It just learns to predict the next integer given a list of integers. Is there any TF code where I can do this easily? (I expect I will have a few thousand documents where each document have ~million integers/tokens).

  2. I implement my tokenizer and the model allows me to plug-in a custom tokenizer. Is there any TF code where I can do this easily?