Possible wrong BigBirdTtokenizationFast special token initialization in pretrained model

Environment info

  • transformers version: 4.9.0
  • Platform: Linux-5.4.104±x86_64-with-Ubuntu-18.04-bionic
  • Python version: 3.7.11
  • PyTorch version (GPU?): 1.9.0+cu102 (True)
  • Tensorflow version (GPU?): 2.5.0 (True)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: No
  • Using distributed or parallel set-up in script?: No

Who can help

@vasudevgupta

Information

Model I am using BigBirdTokenizerFast.from_pretrained(‘google/bigbird-roberta-base’)

The likely problem is when you load the pre-trained tokenizer and check its eos and bos token mapping, it turns out to be opposite of what is expected, namely:
bos_token =
eos_token =

To reproduce

Expected behavior

According to the implementation of BigBirdTokenizerFast the default special token mapping should be:
bos_token =
eos_token =

Hello @nabito,

This issue is similar to Text Classification on GLUE on TPU using Jax/Flax : BigBird · Issue #12483 · huggingface/transformers · GitHub.

Tokenizer config needs to be updated from HuggingFace Hub (while Tokenizer code is absolutely fine) & I need @patrickvonplaten’s approval for updating that.

@patrickvonplaten, fix is quite simple & we just need to run this:

wget https://huggingface.co/google/bigbird-roberta-base/resolve/main/spiece.model

from transformers import BigBirdTokenizer
tokenizer = BigBirdTokenizer("spiece.model")
tokenizer.push_to_hub("google/bigbird-roberta-base")

@nabito, for now you can do this:

wget https://huggingface.co/google/bigbird-roberta-base/resolve/main/spiece.model

from transformers import BigBirdTokenizer
tokenizer = BigBirdTokenizer("spiece.model")

# similarly for fast tokenizer
1 Like

Thank you @vasudevgupta for quick feedback and awesome implementation.

I’ve a related point that like you to clarify a bit (been reading all docs and can’t find elsewhere).

It seems you’ve trained the tokenizer using SentencePiece directly and the BigBirdTokenizer implementation does expect a vocab file to initialize it.

I’m training a Unigram tokenizer from scratch using HF Tokenizers and load to BigBirdTokenizerFast using
tokenizer = BigBirdTokenizerFast(tokenizer_object=my_tokenizer)
as recommended in Tokenizer — transformers 4.7.0 documentation

Everything seems to work ok, except when I try to call tokenizer.save_vocabulary() the behavior of current implementation is to copy whatever in self.vocab_file to an output file, which both doesn’t exist in my case.

I’ve checked that in my case internal vocab attribute (self.vocab) is properly populated from HF Tokenizers object. However, my question is, shouldn’t BigBirdTokenizerFast taking care of saving vocabulary from the vocab attribute rather than trying to make a copy from vocab_file?

I checked BertTokenizer implementation of save_vocabulary() that do something like this:

        with open(vocab_file, "w", encoding="utf-8") as writer:
            for token, token_index in sorted(self.vocab.items(), key=lambda kv: kv[1]):
                if index != token_index:
                    logger.warning(
                        f"Saving vocabulary to {vocab_file}: vocabulary indices are not consecutive."
                        " Please check that the vocabulary is not corrupted!"
                    )

https://huggingface.co/transformers/_modules/transformers/models/bert/tokenization_bert.html#BertTokenizer

Could you clarify if using HF Tokenizers training from scratch is supported by BigBirdTokenizerFast?
Or should I go train it directly from SentencePience?

Sorry for semi-hijack the thread but this is how I found the earlier bug.

Hello @nabito,

This tutorial might interest you: https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/tokenizer_training.ipynb

I didn’t train the tokenizer for BigBird while adding it to Transformers & rather used this vocabulary file directly: bigbird/bigbird/vocab at master · google-research/bigbird · GitHub

Possibly, the colab link (I shared above) will help you in training BigBirdTokenizerFast on your dataset.