AutoTokenizer.encode with multiThread and mutliProcess

Hello, every one, I want to convert the all texts to index tokens and save them in files.
So I use the encode Function to do it.
I tried it with the multiProcess way and multiThread way,

tokenizer = AutoTokenizer.from_pretrained("./llama2-tokenizer", trust_remote_code=True)

#MultiThread way
with ThreadPoolExecutor(max_workers=8) as pool:
        result = list(pool.map(tokenizer,data))  #data is a list of str

#MultiProcess way
with ProcessPoolExecutor(max_workers=8) as pool:
        result = list(pool.map(tokenizer,data))

result shows the MultiProcess way is much slower than the MultiThread way. With total 1M tokens, MultiProcess way takes about 34s while MultiThread way takes about 2s.

Isn’t the encode function computationally intensive? Since the low-level implementation of the encode is in the Tokenizers package with Rust. I can not figure out how the above result was caused. Can anybody offers some explanation ? Thanks!

1 Like

Hey

Best option would be to use the hugging face datasets class and use the “.map()” method with an argument “num_proc” that enables parallel tokenization : Main classes

Thanks for your answer!
I know that it is one of a good way to use huggingface datasets.map function. But I also wonder how to do it in a more custom way without such module. And I find that by adding the chunksize parameter, the total time with MultiProcess way decrease a lot, about that same as MultiThread way.