Get "using the `__call__` method is faster" warning with DataCollatorWithPadding

When I use the out-of-the-box DataCollatorWithPadding I get my output filled with the warning:

You’re using a DebertaV2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the __call__ method is faster than using a method to encode the text followed by a call to the pad method to get a padded encoding.

If I change to use a different, custom collator, then the warning goes away.
Would anyone know what I could be doing wrong that’s causing this warning?

Or alternatively, if I can’t fix the problem that’s causing this warning, is there a way to hide it?
I’ve tried a few different ways of turning off warnings, but so far I’ve had no luck and because it gets written out multiple times it starts to swamp the actual output from my training.

5 Likes

I’m getting the same thing using a BertTokenizerFast with DataCollatorWithPadding - the error appears once for each worker every time I loop over a DataLoader. I would prefer not to silence warnings in my training code, but here’s how I’m getting around it (based on this line in the PretrainedTokenizerBase class, referencing this section in the custom logger):

import os
os.environ['TRANSFORMERS_NO_ADVISORY_WARNINGS'] = 'true'
1 Like

Same situation - happened with usage of DistilBertTokenizerFast tokenizer on using DataCollatorWithPadding. I don’t know what does this warning actually mean? And if I wanted to follow the suggestion as hinted in the warning message, what should I do?

3 Likes

I am getting following suggestion while using NllbTokenizerFast tokenize
You’re using a NllbTokenizerFast tokenizer. Please note that with a fast tokenizer, using the __call__ method is faster than using a method to encode the text followed by a call to the pad method to get a padded encoding.

Can anyone explain the given suggestion?

I guess the problem most people in this thread have is not with the warnings but with the lack of knowledge (at least at my part) if there is a better way of doing what I am doing. I also use a preprocess function.

1 Like

It would be nice to have an answer from someone on this?

So my guess is that doing tokenizer(input) is faster than wrapping the datacollator around?

3 Likes

Any news on this? Personally I found this warning more annoying than informative, so I ended up commenting it out in the source code, since I couldn’t find any information.

3 Likes

Hello,

the warning is confusing because it does not tell you how to use the tokenizer(...).
I think it just means, that one should use the padding feature of the tokenizer when tokenizing your dataset. So, instead of using a collator to pad your inputs, skip this step and use tokenizer(..., padding=True) to add padding while tokenizing. Apparently, this is faster for FastTokenizers (i.e. tokenizers written in Rust).

Remark: In case you didn’t know when calling tokenizer(...) this internally calls Tokenizer.__call__. This is a magic python method: python-callable-instances

Disclaimer: That’s just my interpretation of what it could mean. I would be happy if someone could confirm this.