Why the HF tokenizer time is bigger when launched just once?

Hello,

I guess this is more a python for loop issue and/or a Colab one but as I tested it with a HF tokenizer, I’m sending this question to this forum: Why the HF tokenizer time is bigger when launched just once?

I published a colab notebook to explain this issue that is showed in the following graph:

image

If I launch just one time the tokenizer on a text, it will always takes a much bigger time than the average time of x tokenizations of the same text. Strange, no?

Configuration

  • transformers version: 4.11.3
  • tokenizer from the model bert-base-uncased
  • importation code of the tokenizer:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)

I think it is just the first time running the tokenization that results in a bigger time and subsequent calls are faster. I noticed that this only happens with the fast tokenizer, I think this is due to how the fast is designed, although I don’t know the details below it, maybe a delayed operation?

Hi @Emanuel,

Thanks for your answer but If you’re right (it is just the first time running the tokenization that results in a bigger time and subsequent calls are faster), this would be a huge problem in production when we need to tokenize one by one the texts (ie, when the user needs it)… and the second question would be then: why the (fast) HF tokenizer acts this way? (what is the tecnical reason behind this "attitude?).

After profiling the method I saw no different instructions being executed on the first and second calls, therefore it isn’t a behavior from the library that does delay operations as I erroneously said before. My guess is due to the .pyc not being created yet, by the second time it will take advantage of this and run faster.
I don’t think this would be an issue in production since the .pyc are going to be there after the first call.

1 Like

I do not think that makes a lot of sense here. Why would a pyc be created again and again for new loops within the same Python session. That defeats the whole point of having pyc in the first place.

I do not know the cause of the delay (perhaps related to how Rust handles this behind the scenes), but it is not problematic. Even though the difference between one run and a loop of 100.000 has a 100x difference in speed, the slowest (single) tokenisation run still only takes 1-2ms. That’s peanuts and that difference won’t be noticeable to the user.

Bram, it is just the first call that is slow, all the subsequent ones are faster. For example, if you run the cell that tokenizes the text only once, at the first time you will have ~26ms and if you run the same a second time you will have 2ms. Therefore, it doesn’t matter if you tokenize 1 or N samples.

That’s not what I understand from the code. If you look at the output of the notebook. Note that the same tokenizer is used in all cases and not reinitialised. So It is not “jus the first call to the tokenizer”.

  • tokenizing only once: 20.22ms
  • tokenizing ten times the same text: avg. 7.96ms
  • tokenizing 100 times: avg. 0.19ms

So yes, the first time in the loop is slow, but not the first time for the tokenizer because the tokenizer is not re-init. So it is not likely that a pyc is recreated every time a new loop is encountered.

I thought at first that perhaps the result was being cached by the tokenizer (~ lru cache) but even with random string generation at each iteration I see this behavior.

1 Like