Possible wrong BigBirdTtokenizationFast special token initialization in pretrained model

nabito · July 26, 2021, 2:37pm

Environment info

transformers version: 4.9.0
Platform: Linux-5.4.104±x86_64-with-Ubuntu-18.04-bionic
Python version: 3.7.11
PyTorch version (GPU?): 1.9.0+cu102 (True)
Tensorflow version (GPU?): 2.5.0 (True)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: No
Using distributed or parallel set-up in script?: No

Who can help

Information

Model I am using BigBirdTokenizerFast.from_pretrained(‘google/bigbird-roberta-base’)

The likely problem is when you load the pre-trained tokenizer and check its eos and bos token mapping, it turns out to be opposite of what is expected, namely:
bos_token =
eos_token =

~~To reproduce~~

gist.github.com

https://gist.github.com/nabito/64155ccff96f2fd08311374d5bf2bb6f

bigbird-playground.ipynb

{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "name": "BigBird Playground.ipynb",
      "provenance": [],
      "machine_shape": "hm",
      "authorship_tag": "ABX9TyO9QaJHqhGtlYbgYV84ucT8",
      "include_colab_link": true

This file has been truncated. show original

Expected behavior

According to the implementation of BigBirdTokenizerFast the default special token mapping should be:
bos_token =
~~eos_token =~~

vasudevgupta · July 26, 2021, 5:35pm

Hello @nabito,

This issue is similar to Text Classification on GLUE on TPU using Jax/Flax : BigBird · Issue #12483 · huggingface/transformers · GitHub.

Tokenizer config needs to be updated from HuggingFace Hub (while Tokenizer code is absolutely fine) & I need @patrickvonplaten’s approval for updating that.

@patrickvonplaten, fix is quite simple & we just need to run this:

wget https://huggingface.co/google/bigbird-roberta-base/resolve/main/spiece.model

from transformers import BigBirdTokenizer
tokenizer = BigBirdTokenizer("spiece.model")
tokenizer.push_to_hub("google/bigbird-roberta-base")

@nabito, for now you can do this:

wget https://huggingface.co/google/bigbird-roberta-base/resolve/main/spiece.model

from transformers import BigBirdTokenizer
tokenizer = BigBirdTokenizer("spiece.model")

# similarly for fast tokenizer

nabito · July 28, 2021, 10:31am

Thank you @vasudevgupta for quick feedback and awesome implementation.

I’ve a related point that like you to clarify a bit (been reading all docs and can’t find elsewhere).

It seems you’ve trained the tokenizer using SentencePiece directly and the BigBirdTokenizer implementation does expect a vocab file to initialize it.

I’m training a Unigram tokenizer from scratch using HF Tokenizers and load to BigBirdTokenizerFast using
tokenizer = BigBirdTokenizerFast(tokenizer_object=my_tokenizer)
as recommended in Tokenizer — transformers 4.7.0 documentation

Everything seems to work ok, except when I try to call tokenizer.save_vocabulary() the behavior of current implementation is to copy whatever in self.vocab_file to an output file, which both doesn’t exist in my case.

I’ve checked that in my case internal vocab attribute (self.vocab) is properly populated from HF Tokenizers object. However, my question is, shouldn’t BigBirdTokenizerFast taking care of saving vocabulary from the vocab attribute rather than trying to make a copy from vocab_file?

I checked BertTokenizer implementation of save_vocabulary() that do something like this:

        with open(vocab_file, "w", encoding="utf-8") as writer:
            for token, token_index in sorted(self.vocab.items(), key=lambda kv: kv[1]):
                if index != token_index:
                    logger.warning(
                        f"Saving vocabulary to {vocab_file}: vocabulary indices are not consecutive."
                        " Please check that the vocabulary is not corrupted!"
                    )

https://huggingface.co/transformers/_modules/transformers/models/bert/tokenization_bert.html#BertTokenizer

Could you clarify if using HF Tokenizers training from scratch is supported by BigBirdTokenizerFast?
Or should I go train it directly from SentencePience?

Sorry for semi-hijack the thread but this is how I found the earlier bug.

vasudevgupta · July 29, 2021, 4:18pm

Hello @nabito,

This tutorial might interest you: https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/tokenizer_training.ipynb

I didn’t train the tokenizer for BigBird while adding it to Transformers & rather used this vocabulary file directly: bigbird/bigbird/vocab at master · google-research/bigbird · GitHub

Possibly, the colab link (I shared above) will help you in training BigBirdTokenizerFast on your dataset.

Topic		Replies	Views
Question about setting special token id in class transformers.BigBirdConfig Models	0	159	January 29, 2024
Bigbird pretraining Beginners	3	885	March 16, 2022
Why BigBirdTokenizer can’t load my own vocab or trained BPE results？ Beginners	2	2781	September 3, 2021
Load custom pretrained tokenizer 🤗Tokenizers	0	1609	October 28, 2021
Whisper is not learning a new tokenizer, even when i make test and train dataset the same Beginners	0	372	November 20, 2023

Possible wrong BigBirdTtokenizationFast special token initialization in pretrained model

Environment info

Who can help

Information

To reproduce

Expected behavior

Related topics