AutoTokenizer.encode with multiThread and mutliProcess

DaleMeng · October 9, 2024, 3:48am

Hello, every one, I want to convert the all texts to index tokens and save them in files.
So I use the encode Function to do it.
I tried it with the multiProcess way and multiThread way,

tokenizer = AutoTokenizer.from_pretrained("./llama2-tokenizer", trust_remote_code=True)

#MultiThread way
with ThreadPoolExecutor(max_workers=8) as pool:
        result = list(pool.map(tokenizer,data))  #data is a list of str

#MultiProcess way
with ProcessPoolExecutor(max_workers=8) as pool:
        result = list(pool.map(tokenizer,data))

result shows the MultiProcess way is much slower than the MultiThread way. With total 1M tokens, MultiProcess way takes about 34s while MultiThread way takes about 2s.

Isn’t the encode function computationally intensive? Since the low-level implementation of the encode is in the Tokenizers package with Rust. I can not figure out how the above result was caused. Can anybody offers some explanation ? Thanks!

samchain · October 9, 2024, 7:59am

Hey

Best option would be to use the hugging face datasets class and use the “.map()” method with an argument “num_proc” that enables parallel tokenization : Main classes

DaleMeng · October 9, 2024, 9:20am

Thanks for your answer!
I know that it is one of a good way to use huggingface datasets.map function. But I also wonder how to do it in a more custom way without such module. And I find that by adding the chunksize parameter, the total time with MultiProcess way decrease a lot, about that same as MultiThread way.

Topic		Replies	Views
Tokenizers v0.8.0 is out! 🤗Tokenizers	0	1510	July 7, 2020
Tokenizer dataset is very slow 🤗Tokenizers	3	4316	March 2, 2024
When using Dataset.map to tokenize a dataset, the speed slows down as the progress approaches 100% 🤗Datasets	3	888	December 23, 2024
Transformers Tokenizer on GPU? 🤗Transformers	3	15083	December 17, 2020
How to encode in fast mode using a local tokenizer? Beginners	0	351	May 3, 2023

AutoTokenizer.encode with multiThread and mutliProcess

Related topics