Why does tokenization take so long?

Jason-Hwang1 · August 13, 2024, 4:13am

I found that tokenization steps before training takes longer time than training itself.

Yes, Training involves GPU, but I thought tokenization is not that compute-intensive (just splitting the sentence into words and mapping words to ID and other substeps…) so I thought it should be bounded to the IO time for loading the raw dataset. But it takes much time to tokenize subset of Wikipedia for more than 2 hours.

Can someone give me the reason why tokenization steps takes so long?

The code that I used for tokenization is as below. I also tried multiprocessing but it doesn’t make meaningful difference.

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

    def encode_example(example):
        return tokenizer(example['text'], truncation=True, padding='max_length', max_length=64)  # Reduce max_length to save memory

    # Tokenize dataset
    dataset = dataset.map(encode_example, batched=True)

gweidart · March 19, 2025, 9:12am

lol. That depends on the tokenizer you’re using.

Check out ‘rs-bpe’ on PyPI / GitHub. It currently outperforms both tiktoken and tokenizers.

Topic		Replies	Views
Tokenizer taking extremely long time to train 🤗Tokenizers	1	970	March 19, 2025
Fine-tuned BERT tokenizer taking too long to load 🤗Tokenizers	1	3431	August 23, 2022
Tokenizer performance is slow, after call to dataset map 🤗Datasets	0	166	June 15, 2024
How long to expect training to take, and guidance on subset size? 🤗Tokenizers	1	2038	May 23, 2024
Speed up tokenizer training 🤗Tokenizers	5	1218	September 17, 2024

Why does tokenization take so long?

Related topics