Tokenize a large corpus

fabiancpl · July 25, 2024, 9:55am

Hi guys,

How can I efficiently tokenize a corpus of 340M of sentences for pre-training models like BERT or GPT-2? Following the standar documentation, the process will take around 20 hours to finish.

Thanks.

Topic		Replies	Views
Why does tokenization take so long? 🤗Tokenizers	1	402	March 19, 2025
Speeding up Tokenization on large text corpus 🤗Transformers	0	439	September 26, 2022
Fine-tuned BERT tokenizer taking too long to load 🤗Tokenizers	1	3431	August 23, 2022
Question about maximum number of tokens Research	1	6199	February 9, 2021
GPT2 long text approach 🤗Tokenizers	0	556	December 20, 2022

Tokenize a large corpus

Related topics