TransformerXL on Custom Language

I am looking to train a Transformer model on biological sequence data where a “sentence” may be represented as follows: [“A”, “LLGR”, “V”, “GD” …]

In order to do so, I need a word-level tokenizer that doesn’t split up words, so I’ve opted for TransformerXL.

The train() function from ByteLevelBPETokenizer() (as described in this blog: https://huggingface.co/blog/how-to-train) is not available for TransfoXLTokenizer()

Is there a way to train TransfoXLTokenizer() on my custom “language”? Or do I simply train the TransfoXLModel and it’ll take care of custom tokenization? (I can’t see this being the case as its a completely unseen “language”)

No need to train.TransfoXLTokenizer() doesn’t merge words like ByteLevelBPETokenizer().

Simply create your vocab file as a txt file with each word on a new line, and feed it’s path as a parameter to TransfoXLTokenizer():

tokenizer = TransfoXLTokenizer(vocab_file=vocab_path)