Cache & parallelize long tokenization step

AngledLuffa · November 9, 2022, 8:51am

If I use run_mlm.py to train a new model on a large dataset, the cost of tokenization seems quite high. In my particular case, I am looking to use an existing tokenizer associated with the vinai/phobert-large model. It takes about 20 minutes to tokenize 1G of text. I have 75G I am hoping to feed to the model, which means about a day to tokenize the whole thing.

Is there a faster way to tokenize the data, or is it at least possible to preprocess the tokenization and distribute the work?

Thanks!

AngledLuffa · November 11, 2022, 3:46am

To answer my own question a bit, it went a lot faster tokenizing when I did

--preprocessing_num_workers 8

However, it got stuck in the “Grouping texts in chunks of 256” stage and didn’t do anything for hours:

Grouping texts in chunks of 256 #1: 100%|███████████████████████████████████████████████████████████| 67063/67063 [1:20:14<00:00, 13.93ba/s]
Grouping texts in chunks of 256 #4: 100%|███████████████████████████████████████████████████████████| 67063/67063 [1:26:42<00:00, 12.89ba/s]
Grouping texts in chunks of 256 #2: 100%|███████████████████████████████████████████████████████████| 67063/67063 [1:28:47<00:00, 12.59ba/s]
Grouping texts in chunks of 256 #2:  97%|█████████████████████████████████████████████████████████▍ | 65285/67063 [1:26:41<02:21, 12.54ba/s]
Grouping texts in chunks of 256 #2: 100%|██████████████████████████████████████████████████████████▉| 67061/67063 [1:28:47<00:00, 16.47ba/s]
Grouping texts in chunks of 256 #4: 100%|██████████████████████████████████████████████████████████▉| 67061/67063 [1:26:41<00:00, 16.50ba/s]

After a couple hours of nothing, I gave up and did ctrl-C to restart it, hoping the cached tokenization would be available.

Unfortunately, although I expected the tokenization to be cached at this point, it was not. This is despite there being 200G of stuff in the directory

../.cache/huggingface/datasets/text/default-54a2ef5fec746c53/0.0.0/acc32f2f2ef863c93c2f30c52f7df6cc9053a1c2230b8d7da0d210404683ca08

I did a brief search for threads about run_mlm.py getting stuck at the Grouping texts in chunks phase, and all I found was a thread talking about too many worker processes, which is unfortunate considering one worker process would take 24 hours to tokenize this much stuff.

AngledLuffa · November 11, 2022, 3:48am

The failure of the caching might be explained by this, which is kind of irritating:

 Parameter 'function'=<function main.<locals>.tokenize_function at 0x7f4bf8d7eef0> of the transform datasets.arrow_dataset.Dataset._map_single couldn't be hashed properly, a random hash was used instead. Make sure your transforms and parameters are serializable with pickle or dill for the dataset fingerprinting and caching to work. If you reuse this transform, the caching mechanism will consider it to be different from the previous calls and recompute everything. This warning is only showed once. Subsequent hashing failures won't be showed.```

Topic		Replies	Views
Speeding up Tokenization on large text corpus 🤗Transformers	0	439	September 26, 2022
Clm repeats tokenization when distributed Intermediate	5	1307	July 15, 2022
How to force caching of previously tokenized data? (run_clm.py) Beginners	3	675	November 21, 2023
Speed issues using tokenizer.train_new_from_iterator on ~50GB dataset 🤗Transformers	7	2232	November 11, 2024
Speed up tokenizer training 🤗Tokenizers	5	1212	September 17, 2024

Cache & parallelize long tokenization step

Related topics