Hi guys,
How can I efficiently tokenize a corpus of 340M of sentences for pre-training models like BERT or GPT-2? Following the standar documentation, the process will take around 20 hours to finish.
Thanks.
Hi guys,
How can I efficiently tokenize a corpus of 340M of sentences for pre-training models like BERT or GPT-2? Following the standar documentation, the process will take around 20 hours to finish.
Thanks.