I found that tokenization steps before training takes longer time than training itself.
Yes, Training involves GPU, but I thought tokenization is not that compute-intensive (just splitting the sentence into words and mapping words to ID and other substeps…) so I thought it should be bounded to the IO time for loading the raw dataset. But it takes much time to tokenize subset of Wikipedia for more than 2 hours.
Can someone give me the reason why tokenization steps takes so long?
The code that I used for tokenization is as below. I also tried multiprocessing but it doesn’t make meaningful difference.
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
def encode_example(example):
return tokenizer(example['text'], truncation=True, padding='max_length', max_length=64) # Reduce max_length to save memory
# Tokenize dataset
dataset = dataset.map(encode_example, batched=True)