Fastest way to tokenize millions of examples?

SantoshGupta · June 6, 2022, 1:18am

Currently I am using a pandas column of strings and tokenizing it by defining a function with the tokenization operation, and using that with pandas map to transform my column of texts.

It’s a slow process when I have millions of rows of texts, and I am wondering if there’s a faster way to tokenize all my training examples.

SantoshGupta · July 5, 2022, 3:44am

bump, still looking for a solution

Jaidev · September 26, 2022, 5:38am

I have a similar issue, using a pretrained WordPiece tokenizer on a large corpus of text takes several hours. I’m doing:

tokenizer = AutoTokenizer.from_pretrained(“distilbert-base-uncased”)
train_tokenized_encodings = tokenizer(df[df.split==‘train’].text.tolist(), truncation=True, padding=True, max_length=MAX_LENGTH)

Any suggestions for speed up?

michaelwechner · September 27, 2022, 8:52pm

Two options come to my mind:

parallelization of the process itself
detect similarities between the various rows, such that maybe previous tokenizations can be re-used

but these are suggestions for which I have not practical experience myself, but maybe it helps somehow

JBalabo · March 8, 2024, 12:41am

Can you share code please?

Topic		Replies	Views
Speeding up Tokenization on large text corpus 🤗Transformers	0	436	September 26, 2022
Tokenizer dataset is very slow 🤗Tokenizers	3	4294	March 2, 2024
Speed up tokenizer training 🤗Tokenizers	5	1178	September 17, 2024
Sentiment analysis with large Pandas dataframe 🤗Transformers	2	1621	May 2, 2022
When using Dataset.map to tokenize a dataset, the speed slows down as the progress approaches 100% 🤗Datasets	3	878	December 23, 2024

Fastest way to tokenize millions of examples?

Related topics