Currently I am using a pandas column of strings and tokenizing it by defining a function with the tokenization operation, and using that with pandas
map to transform my column of texts.
It’s a slow process when I have millions of rows of texts, and I am wondering if there’s a faster way to tokenize all my training examples.
bump, still looking for a solution
I have a similar issue, using a pretrained WordPiece tokenizer on a large corpus of text takes several hours. I’m doing:
tokenizer = AutoTokenizer.from_pretrained(“distilbert-base-uncased”)
train_tokenized_encodings = tokenizer(df[df.split==‘train’].text.tolist(), truncation=True, padding=True, max_length=MAX_LENGTH)
Any suggestions for speed up?
Two options come to my mind:
- parallelization of the process itself
- detect similarities between the various rows, such that maybe previous tokenizations can be re-used
but these are suggestions for which I have not practical experience myself, but maybe it helps somehow