Why the HF tokenizer time is bigger when launched just once?

pierreguillou · October 20, 2021, 8:03pm

Hello,

I guess this is more a python for loop issue and/or a Colab one but as I tested it with a HF tokenizer, I’m sending this question to this forum: Why the HF tokenizer time is bigger when launched just once?

I published a colab notebook to explain this issue that is showed in the following graph:

If I launch just one time the tokenizer on a text, it will always takes a much bigger time than the average time of x tokenizations of the same text. Strange, no?

Configuration

transformers version: 4.11.3
tokenizer from the model bert-base-uncased
importation code of the tokenizer:

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)

Emanuel · October 20, 2021, 9:10pm

I think it is just the first time running the tokenization that results in a bigger time and subsequent calls are faster. I noticed that this only happens with the fast tokenizer, I think this is due to how the fast is designed, although I don’t know the details below it, maybe a delayed operation?

pierreguillou · October 20, 2021, 10:19pm

Hi @Emanuel,

Thanks for your answer but If you’re right (it is just the first time running the tokenization that results in a bigger time and subsequent calls are faster), this would be a huge problem in production when we need to tokenize one by one the texts (ie, when the user needs it)… and the second question would be then: why the (fast) HF tokenizer acts this way? (what is the tecnical reason behind this "attitude?).

Emanuel · October 20, 2021, 10:47pm

After profiling the method I saw no different instructions being executed on the first and second calls, therefore it isn’t a behavior from the library that does delay operations as I erroneously said before. My guess is due to the .pyc not being created yet, by the second time it will take advantage of this and run faster.
I don’t think this would be an issue in production since the .pyc are going to be there after the first call.

BramVanroy · October 21, 2021, 8:25am

I do not think that makes a lot of sense here. Why would a pyc be created again and again for new loops within the same Python session. That defeats the whole point of having pyc in the first place.

I do not know the cause of the delay (perhaps related to how Rust handles this behind the scenes), but it is not problematic. Even though the difference between one run and a loop of 100.000 has a 100x difference in speed, the slowest (single) tokenisation run still only takes 1-2ms. That’s peanuts and that difference won’t be noticeable to the user.

Emanuel · October 21, 2021, 12:55pm

Bram, it is just the first call that is slow, all the subsequent ones are faster. For example, if you run the cell that tokenizes the text only once, at the first time you will have ~26ms and if you run the same a second time you will have 2ms. Therefore, it doesn’t matter if you tokenize 1 or N samples.

BramVanroy · October 21, 2021, 1:20pm

That’s not what I understand from the code. If you look at the output of the notebook. Note that the same tokenizer is used in all cases and not reinitialised. So It is not “jus the first call to the tokenizer”.

tokenizing only once: 20.22ms
tokenizing ten times the same text: avg. 7.96ms
tokenizing 100 times: avg. 0.19ms

So yes, the first time in the loop is slow, but not the first time for the tokenizer because the tokenizer is not re-init. So it is not likely that a pyc is recreated every time a new loop is encountered.

I thought at first that perhaps the result was being cached by the tokenizer (~ lru cache) but even with random string generation at each iteration I see this behavior.

Topic		Replies	Views
Difference between tokenizer and tokenizerfast Beginners	4	4210	December 22, 2023
Error with new tokenizers (URGENT!) 🤗Tokenizers	16	51142	July 22, 2024
SentencePieceProcessor encoding differs from AutoTokenizer, how can that be? Beginners	0	851	December 12, 2023
Is it possible to make the first batch as fast as the subsequent ones? 🤗Optimum	1	82	June 25, 2024
Tokenizer dataset is very slow 🤗Tokenizers	3	4287	March 2, 2024

Why the HF tokenizer time is bigger when launched just once?

Related topics