Hello, every one, I want to convert the all texts to index tokens and save them in files.
So I use the encode Function to do it.
I tried it with the multiProcess way and multiThread way,
tokenizer = AutoTokenizer.from_pretrained("./llama2-tokenizer", trust_remote_code=True)
#MultiThread way
with ThreadPoolExecutor(max_workers=8) as pool:
result = list(pool.map(tokenizer,data)) #data is a list of str
#MultiProcess way
with ProcessPoolExecutor(max_workers=8) as pool:
result = list(pool.map(tokenizer,data))
result shows the MultiProcess way is much slower than the MultiThread way. With total 1M tokens, MultiProcess way takes about 34s while MultiThread way takes about 2s.
Isn’t the encode function computationally intensive? Since the low-level implementation of the encode is in the Tokenizers package with Rust. I can not figure out how the above result was caused. Can anybody offers some explanation ? Thanks!