I am looking to train a Transformer model on biological sequence data where a “sentence” may be represented as follows: [“A”, “LLGR”, “V”, “GD” …]
In order to do so, I need a word-level tokenizer that doesn’t split up words, so I’ve opted for TransformerXL.
train() function from
ByteLevelBPETokenizer() (as described in this blog: https://huggingface.co/blog/how-to-train) is not available for
Is there a way to train
TransfoXLTokenizer() on my custom “language”? Or do I simply train the
TransfoXLModel and it’ll take care of custom tokenization? (I can’t see this being the case as its a completely unseen “language”)