Why does tokenization take so long?

I found that tokenization steps before training takes longer time than training itself.

Yes, Training involves GPU, but I thought tokenization is not that compute-intensive (just splitting the sentence into words and mapping words to ID and other substeps…) so I thought it should be bounded to the IO time for loading the raw dataset. But it takes much time to tokenize subset of Wikipedia for more than 2 hours.

Can someone give me the reason why tokenization steps takes so long?

The code that I used for tokenization is as below. I also tried multiprocessing but it doesn’t make meaningful difference.

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

    def encode_example(example):
        return tokenizer(example['text'], truncation=True, padding='max_length', max_length=64)  # Reduce max_length to save memory

    # Tokenize dataset
    dataset = dataset.map(encode_example, batched=True)