Train a new tokenizer from scratch

qmeeus · November 8, 2020, 3:06pm

Hi,

I would like to train a tokenizer from scratch and use it with Bert. I would like to have a subword tokenizer (unigram, bpe, wordpiece) that would generate the right files (special_token_map.json, tokenizer_config.json, added_tokens.json and vocab.txt). As far as I can tell, the tokenizers provided by the tokenizer library are not compatible with transformers.PretrainedTokenizer (you cannot load the files created by one in the other). The tokenizers provided by the transformers library are all supposed to be pretrained. For example, one issue is the way the subwords are handled ("##" in BERT tokenizer, which is handled differently in tokenizers).

What is the right way to create and train a new tokenizer that I can use directly with BERT? And if this definitely not possible, how can I convert one tokenizer to the other (or the files created by one to files that would be understood by the other)?

Thank you very much!

BramVanroy · November 8, 2020, 3:26pm

Have a look at the tokenizers repo.

qmeeus · November 8, 2020, 3:41pm

I did, it is the essence of my question. A tokenizer from the tokenizer repo will generate tokenizer.json and merges.txt while a tokenizer from the transformers repo will require different files. In particular, the decode function is roughly implemented in the following way in transformers.BertTokenizer:

def decode(self, ids):
    return " ".join([self.vocab[token_id] for token_id in ids]).replace("##", "")

In tokenizer’s repo, the process is entirely different, using merges.txt to decide what should be merged and vocab as a lookup table.

My question being: How can I convert one to the other?

BramVanroy · November 8, 2020, 7:55pm

Sorry, I was a bit too quick.

What is your problem with just training your custom tokenizer and using it?

I don’t understand what you mean with this:

The tokenizers provided by the transformers library are all supposed to be pretrained. For example, one issue is the way the subwords are handled (“##” in BERT tokenizer, which is handled differently in tokenizers).

It seems that this is exactly what you are looking for. Once trained you can just load the tokenizer with tokenizers rather than with transformers.PreTrainedTokenizer.

qmeeus · November 10, 2020, 3:51pm

Oh I thought that BertModel needed the specific tokenizer that was shipped with it. I did not try to use the other tokenizers (shame on me). I will try it and post the outcome here for future reference

Thank you very much for your answer!

Topic		Replies	Views
Using a BertWordPieceTokenizer trained from scratch from transformers 🤗Tokenizers	2	4997	March 26, 2021
Tokenizer from tokenizers library cannot be used in transformers.Trainer 🤗Transformers	2	625	July 30, 2021
Pre-training a BERT model from scratch with custom tokenizer Intermediate	5	3103	January 11, 2022
Training sentencePiece from scratch? 🤗Tokenizers	8	19274	December 19, 2023
Questions about the connection between tokenizer and the model Beginners	0	308	September 19, 2023

Train a new tokenizer from scratch

Related topics