Bug in the Flaubert tokenizer_config.json do_lowercase option

This is a bug report for the Flaubert tokenizer.
The tokenizer_config.json of all models of the Flaubert model repo, for example: here has the wrong option name:

  "do_lower_case": true

while it should read do_lowercase, as expected by the FlaubertTokenizer. This results in all Flaubert models having case-sensitive tokenizers.

In my project (flaubert-base-uncased model) the bug first manifested itself in transformers v.4.4.0. Previous versions of transformers didn’t download this file, and I noticed it during the version upgrade. It may be related to Pull Request #10624, but I’m not at all sure here and it probably doesn’t really matter.

Thanks for correcting the bug and many thanks for the great library.

Environment info

  • transformers version: 4.4.0
  • Platform: Linux-5.4.0-80-generic-x86_64-with-glibc2.10
  • Python version: 3.8.8
  • PyTorch version (GPU?): 1.9.0a0+df837d0 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Using GPU in script?: Yes
  • Using distributed or parallel set-up in script?: No

Who can help

@thomwolf @patrickvonplaten

Hi, there hasn’t been any answer to my report, maybe, I have posted it to the wrong place or haven’t tagged the correct person? Could someone please help me get the message through?

Also, in case this helps, here are some steps, that someone can take to reproduce the incorrect model behaviour:

  1. Download the default ‘flaubert-base-uncased’ model from the official repo, fix the seed, start the training and print out something (train loss, some weights, ets).
  2. Then in the same downloaded model, modify the "do_lower_case": false and restart the training. Make sure that the printed value is exactly the same.
  3. Then still in the same downloaded model, modify the option to "do_lowercase": false, restart the training and once again make sure that the printed value is the same.

Basically, those three trainings are the same because in the first two cases the network uses the default value of "do_lowercase": false, while in the third one we explicitly select the option.

  1. Finally, in the very same model, set "do_lowercase": true, launch the training and check that now the training is different, and that this is indeed the correct option name, which controls the upper/lower case of the model.