I would like to train a tokenizer from scratch and use it with Bert. I would like to have a subword tokenizer (unigram, bpe, wordpiece) that would generate the right files (
vocab.txt). As far as I can tell, the tokenizers provided by the tokenizer library are not compatible with
transformers.PretrainedTokenizer (you cannot load the files created by one in the other). The tokenizers provided by the transformers library are all supposed to be pretrained. For example, one issue is the way the subwords are handled ("##" in BERT tokenizer, which is handled differently in tokenizers).
What is the right way to create and train a new tokenizer that I can use directly with BERT? And if this definitely not possible, how can I convert one tokenizer to the other (or the files created by one to files that would be understood by the other)?
Thank you very much!
Have a look at the tokenizers repo.
I did, it is the essence of my question. A tokenizer from the tokenizer repo will generate
merges.txt while a tokenizer from the transformers repo will require different files. In particular, the
decode function is roughly implemented in the following way in
def decode(self, ids):
return " ".join([self.vocab[token_id] for token_id in ids]).replace("##", "")
In tokenizer’s repo, the process is entirely different, using
merges.txt to decide what should be merged and
vocab as a lookup table.
My question being: How can I convert one to the other?
Sorry, I was a bit too quick.
What is your problem with just training your custom tokenizer and using it?
I don’t understand what you mean with this:
The tokenizers provided by the transformers library are all supposed to be pretrained. For example, one issue is the way the subwords are handled ("##" in BERT tokenizer, which is handled differently in tokenizers).
It seems that this is exactly what you are looking for. Once trained you can just load the tokenizer with
tokenizers rather than with
Oh I thought that BertModel needed the specific tokenizer that was shipped with it. I did not try to use the other tokenizers (shame on me). I will try it and post the outcome here for future reference
Thank you very much for your answer!