Train a new tokenizer from scratch

Hi,

I would like to train a tokenizer from scratch and use it with Bert. I would like to have a subword tokenizer (unigram, bpe, wordpiece) that would generate the right files (special_token_map.json, tokenizer_config.json, added_tokens.json and vocab.txt). As far as I can tell, the tokenizers provided by the tokenizer library are not compatible with transformers.PretrainedTokenizer (you cannot load the files created by one in the other). The tokenizers provided by the transformers library are all supposed to be pretrained. For example, one issue is the way the subwords are handled ("##" in BERT tokenizer, which is handled differently in tokenizers).

What is the right way to create and train a new tokenizer that I can use directly with BERT? And if this definitely not possible, how can I convert one tokenizer to the other (or the files created by one to files that would be understood by the other)?

Thank you very much!

Have a look at the tokenizers repo.

I did, it is the essence of my question. A tokenizer from the tokenizer repo will generate tokenizer.json and merges.txt while a tokenizer from the transformers repo will require different files. In particular, the decode function is roughly implemented in the following way in transformers.BertTokenizer:

def decode(self, ids):
    return " ".join([self.vocab[token_id] for token_id in ids]).replace("##", "")

In tokenizer’s repo, the process is entirely different, using merges.txt to decide what should be merged and vocab as a lookup table.

My question being: How can I convert one to the other?

Sorry, I was a bit too quick.

What is your problem with just training your custom tokenizer and using it?

I don’t understand what you mean with this:

The tokenizers provided by the transformers library are all supposed to be pretrained. For example, one issue is the way the subwords are handled ("##" in BERT tokenizer, which is handled differently in tokenizers).

It seems that this is exactly what you are looking for. Once trained you can just load the tokenizer with tokenizers rather than with transformers.PreTrainedTokenizer.

Oh I thought that BertModel needed the specific tokenizer that was shipped with it. I did not try to use the other tokenizers (shame on me). I will try it and post the outcome here for future reference

Thank you very much for your answer!