I have a similar issue, using a pretrained WordPiece tokenizer on a large corpus of text takes several hours. Iām doing:
tokenizer = AutoTokenizer.from_pretrained(ādistilbert-base-uncasedā)
train_tokenized_encodings = tokenizer(df[df.split==ātrainā].text.tolist(), truncation=True, padding=True, max_length=MAX_LENGTH)
Any suggestions for speed up?
Is there a way to parallelize this? (Or does the above automatically use multiple workers?)