Hi, I am finding the tokenizing takes long time when I have large text data. There may be some documentation about this somewhere, but I could not find any that address how to use multiple GPUs to process the tokenization. Any help will be much appreciated. Thanks!
@Narsil might help be able to help here
Hi @jaecha.
Tokenization does not happen on GPU (and won’t anytime soon). If you can show your tokenizer config that could help understand why it takes a long time ? tokenizers version, what kind of model and maybe how large your data ?
If you see a warning about TOKENIZERS_PARALLELISM in your console:
- You use multiple threads (like with
DataLoader
) then it’s better to create a tokenizer instance on each thread rather than before the fork otherwise we can’t use multiple cores (because of GIL)
Having a good pre_tokenizer is important (usually Whitespace
splitting for languages that allow it) at least.
You can find more information about your options here: https://huggingface.co/docs/tokenizers/python/latest/pipeline.html#pre-tokenization
Does that help?
Cheers
Nicolas
Thank you for your replies! I think I know what to do next. Thanks!