Model I am using BigBirdTokenizerFast.from_pretrained(‘google/bigbird-roberta-base’)
The likely problem is when you load the pre-trained tokenizer and check its eos and bos token mapping, it turns out to be opposite of what is expected, namely:
bos_token =
eos_token =
To reproduce
Expected behavior
According to the implementation of BigBirdTokenizerFast the default special token mapping should be:
bos_token =
eos_token =
Tokenizer config needs to be updated from HuggingFace Hub (while Tokenizer code is absolutely fine) & I need @patrickvonplaten’s approval for updating that.
wget https://huggingface.co/google/bigbird-roberta-base/resolve/main/spiece.model
from transformers import BigBirdTokenizer
tokenizer = BigBirdTokenizer("spiece.model")
# similarly for fast tokenizer
Thank you @vasudevgupta for quick feedback and awesome implementation.
I’ve a related point that like you to clarify a bit (been reading all docs and can’t find elsewhere).
It seems you’ve trained the tokenizer using SentencePiece directly and the BigBirdTokenizer implementation does expect a vocab file to initialize it.
I’m training a Unigram tokenizer from scratch using HF Tokenizers and load to BigBirdTokenizerFast using tokenizer = BigBirdTokenizerFast(tokenizer_object=my_tokenizer)
as recommended in Tokenizer — transformers 4.7.0 documentation
Everything seems to work ok, except when I try to call tokenizer.save_vocabulary() the behavior of current implementation is to copy whatever in self.vocab_file to an output file, which both doesn’t exist in my case.
I’ve checked that in my case internal vocab attribute (self.vocab) is properly populated from HF Tokenizers object. However, my question is, shouldn’t BigBirdTokenizerFast taking care of saving vocabulary from the vocab attribute rather than trying to make a copy from vocab_file?
I checked BertTokenizer implementation of save_vocabulary() that do something like this:
with open(vocab_file, "w", encoding="utf-8") as writer:
for token, token_index in sorted(self.vocab.items(), key=lambda kv: kv[1]):
if index != token_index:
logger.warning(
f"Saving vocabulary to {vocab_file}: vocabulary indices are not consecutive."
" Please check that the vocabulary is not corrupted!"
)
Could you clarify if using HF Tokenizers training from scratch is supported by BigBirdTokenizerFast?
Or should I go train it directly from SentencePience?
Sorry for semi-hijack the thread but this is how I found the earlier bug.