Tokenize a large corpus

Hi guys,

How can I efficiently tokenize a corpus of 340M of sentences for pre-training models like BERT or GPT-2? Following the standar documentation, the process will take around 20 hours to finish.

Thanks.