Transformers Tokenizer on GPU?

Hi, I am finding the tokenizing takes long time when I have large text data. There may be some documentation about this somewhere, but I could not find any that address how to use multiple GPUs to process the tokenization. Any help will be much appreciated. Thanks!

@Narsil might help be able to help here

Hi @jaecha.

Tokenization does not happen on GPU (and won’t anytime soon). If you can show your tokenizer config that could help understand why it takes a long time ? tokenizers version, what kind of model and maybe how large your data ?
If you see a warning about TOKENIZERS_PARALLELISM in your console:

  • You use multiple threads (like with DataLoader) then it’s better to create a tokenizer instance on each thread rather than before the fork otherwise we can’t use multiple cores (because of GIL)

Having a good pre_tokenizer is important (usually Whitespace splitting for languages that allow it) at least.

You can find more information about your options here: https://huggingface.co/docs/tokenizers/python/latest/pipeline.html#pre-tokenization

Does that help?

Cheers
Nicolas

Thank you for your replies! I think I know what to do next. Thanks!