Transformers Tokenizer on GPU?

jaecha · December 16, 2020, 2:39pm

Hi, I am finding the tokenizing takes long time when I have large text data. There may be some documentation about this somewhere, but I could not find any that address how to use multiple GPUs to process the tokenization. Any help will be much appreciated. Thanks!

clem · December 17, 2020, 3:29am

@Narsil might help be able to help here

Narsil · December 17, 2020, 8:39am

Hi @jaecha.

Tokenization does not happen on GPU (and won’t anytime soon). If you can show your tokenizer config that could help understand why it takes a long time ? tokenizers version, what kind of model and maybe how large your data ?
If you see a warning about TOKENIZERS_PARALLELISM in your console:

You use multiple threads (like with DataLoader) then it’s better to create a tokenizer instance on each thread rather than before the fork otherwise we can’t use multiple cores (because of GIL)

Having a good pre_tokenizer is important (usually Whitespace splitting for languages that allow it) at least.

You can find more information about your options here: https://huggingface.co/docs/tokenizers/python/latest/pipeline.html#pre-tokenization

Does that help?

Cheers
Nicolas

jaecha · December 17, 2020, 2:28pm

Thank you for your replies! I think I know what to do next. Thanks!

Topic		Replies	Views
[Help] GPU with query answering 🤗Transformers	0	328	November 25, 2020
Speed issues using tokenizer.train_new_from_iterator on ~50GB dataset 🤗Transformers	7	2228	November 11, 2024
Is Transformers using GPU by default? Beginners	6	154360	December 11, 2023
Speed expectations for production BERT models on CPU vs GPU? Beginners	1	2144	October 2, 2020
Model Parallelism, how to parallelize transformer? Beginners	3	12704	June 18, 2021

Transformers Tokenizer on GPU?

Related topics