Bug in the Flaubert tokenizer_config.json do_lowercase option

vmariassin · September 17, 2021, 2:23pm

Hi,
This is a bug report for the Flaubert tokenizer.
The tokenizer_config.json of all models of the Flaubert model repo, for example: here has the wrong option name:

{
  "do_lower_case": true
}

while it should read do_lowercase, as expected by the FlaubertTokenizer. This results in all Flaubert models having case-sensitive tokenizers.

In my project (flaubert-base-uncased model) the bug first manifested itself in transformers v.4.4.0. Previous versions of transformers didn’t download this file, and I noticed it during the version upgrade. It may be related to Pull Request #10624, but I’m not at all sure here and it probably doesn’t really matter.

Thanks for correcting the bug and many thanks for the great library.

Environment info

transformers version: 4.4.0
Platform: Linux-5.4.0-80-generic-x86_64-with-glibc2.10
Python version: 3.8.8
PyTorch version (GPU?): 1.9.0a0+df837d0 (True)
Tensorflow version (GPU?): not installed (NA)
Using GPU in script?: Yes
Using distributed or parallel set-up in script?: No

Who can help

@thomwolf @patrickvonplaten

vmariassin · September 28, 2021, 6:33pm

Hi, there hasn’t been any answer to my report, maybe, I have posted it to the wrong place or haven’t tagged the correct person? Could someone please help me get the message through?

Also, in case this helps, here are some steps, that someone can take to reproduce the incorrect model behaviour:

Download the default ‘flaubert-base-uncased’ model from the official repo, fix the seed, start the training and print out something (train loss, some weights, ets).
Then in the same downloaded model, modify the "do_lower_case": false and restart the training. Make sure that the printed value is exactly the same.
Then still in the same downloaded model, modify the option to "do_lowercase": false, restart the training and once again make sure that the printed value is the same.

Basically, those three trainings are the same because in the first two cases the network uses the default value of "do_lowercase": false, while in the third one we explicitly select the option.

Finally, in the very same model, set "do_lowercase": true, launch the training and check that now the training is different, and that this is indeed the correct option name, which controls the upper/lower case of the model.

Topic		Replies	Views
Model Performance and Sanity check Intermediate	0	353	March 7, 2024
Fine-tuned transformers model generats nonsensical results Beginners	0	216	July 10, 2024
TypeError: forward() got an unexpected keyword argument 'token_type_ids' Beginners	3	3262	June 10, 2022
Why Bert-chinese use do_lower_case=False? 🤗Tokenizers	0	481	December 24, 2020
Transformers v3.0.0 is out! 🤗Transformers	0	1936	July 7, 2020

Bug in the Flaubert tokenizer_config.json do_lowercase option

Environment info

Who can help

Related topics