Fastest way to tokenize millions of examples?

Currently I am using a pandas column of strings and tokenizing it by defining a function with the tokenization operation, and using that with pandas map to transform my column of texts.

Itā€™s a slow process when I have millions of rows of texts, and I am wondering if thereā€™s a faster way to tokenize all my training examples.

1 Like

bump, still looking for a solution

I have a similar issue, using a pretrained WordPiece tokenizer on a large corpus of text takes several hours. Iā€™m doing:

tokenizer = AutoTokenizer.from_pretrained(ā€œdistilbert-base-uncasedā€)
train_tokenized_encodings = tokenizer(df[df.split==ā€˜trainā€™].text.tolist(), truncation=True, padding=True, max_length=MAX_LENGTH)

Any suggestions for speed up?

Two options come to my mind:

  • parallelization of the process itself
  • detect similarities between the various rows, such that maybe previous tokenizations can be re-used

but these are suggestions for which I have not practical experience myself, but maybe it helps somehow

1 Like

Can you share code please?