Get "using the `call` method is faster" warning with DataCollatorWithPadding

SteveR · October 3, 2022, 8:43am

When I use the out-of-the-box DataCollatorWithPadding I get my output filled with the warning:

You’re using a DebertaV2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the __call__ method is faster than using a method to encode the text followed by a call to the pad method to get a padded encoding.

If I change to use a different, custom collator, then the warning goes away.
Would anyone know what I could be doing wrong that’s causing this warning?

Or alternatively, if I can’t fix the problem that’s causing this warning, is there a way to hide it?
I’ve tried a few different ways of turning off warnings, but so far I’ve had no luck and because it gets written out multiple times it starts to swamp the actual output from my training.

mnaylor · October 5, 2022, 6:38pm

I’m getting the same thing using a BertTokenizerFast with DataCollatorWithPadding - the error appears once for each worker every time I loop over a DataLoader. I would prefer not to silence warnings in my training code, but here’s how I’m getting around it (based on this line in the PretrainedTokenizerBase class, referencing this section in the custom logger):

import os
os.environ['TRANSFORMERS_NO_ADVISORY_WARNINGS'] = 'true'

Bikash · October 7, 2022, 10:07am

Same situation - happened with usage of DistilBertTokenizerFast tokenizer on using DataCollatorWithPadding. I don’t know what does this warning actually mean? And if I wanted to follow the suggestion as hinted in the warning message, what should I do?

KarmaCST · November 24, 2022, 6:04am

I am getting following suggestion while using NllbTokenizerFast tokenize
You’re using a NllbTokenizerFast tokenizer. Please note that with a fast tokenizer, using the __call__ method is faster than using a method to encode the text followed by a call to the pad method to get a padded encoding.

Can anyone explain the given suggestion?

fredguth · May 9, 2023, 3:29pm

I guess the problem most people in this thread have is not with the warnings but with the lack of knowledge (at least at my part) if there is a better way of doing what I am doing. I also use a preprocess function.

ollibolli · May 13, 2023, 2:08pm

It would be nice to have an answer from someone on this?

So my guess is that doing tokenizer(input) is faster than wrapping the datacollator around?

filippobistaffa · September 21, 2023, 8:45am

Any news on this? Personally I found this warning more annoying than informative, so I ended up commenting it out in the source code, since I couldn’t find any information.

d3nigma · February 1, 2024, 7:19pm

Hello,

the warning is confusing because it does not tell you how to use the tokenizer(...).
I think it just means, that one should use the padding feature of the tokenizer when tokenizing your dataset. So, instead of using a collator to pad your inputs, skip this step and use tokenizer(..., padding=True) to add padding while tokenizing. Apparently, this is faster for FastTokenizers (i.e. tokenizers written in Rust).

Remark: In case you didn’t know when calling tokenizer(...) this internally calls Tokenizer.__call__. This is a magic python method: python-callable-instances

Disclaimer: That’s just my interpretation of what it could mean. I would be happy if someone could confirm this.

AtxTom · June 3, 2024, 9:21pm

You can turn off the warning with

tokenizer.deprecation_warnings["Asking-to-pad-a-fast-tokenizer"] = True

See When using transformers.DataCollatorWithPadding normally, always get annoying warning · Issue #22638 · huggingface/transformers · GitHub

Topic		Replies	Views
Different padding behaviour of data collator 🤗Transformers	0	89	August 23, 2024
Stucked on "Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding." Intermediate	0	2031	June 27, 2023
How can I combine encode and padding into __call__ during training if I want to pad to batch longest 🤗Transformers	0	617	October 3, 2022
DataCollator vs. Tokenizers 🤗Transformers	1	3788	May 1, 2021
Pre-tokenization vs. mini-batch tokenization and TOKENIZERS_PARALLELISM warning 🤗Transformers	2	7507	March 3, 2024

Get "using the `__call__` method is faster" warning with DataCollatorWithPadding

Related topics

Get "using the `call` method is faster" warning with DataCollatorWithPadding