I am looking to train a Transformer model on biological sequence data where a “sentence” may be represented as follows: [“A”, “LLGR”, “V”, “GD” …]
In order to do so, I need a word-level tokenizer that doesn’t split up words, so I’ve opted for TransformerXL.
The train()
function from ByteLevelBPETokenizer()
(as described in this blog: https://huggingface.co/blog/how-to-train) is not available for TransfoXLTokenizer()
Is there a way to train TransfoXLTokenizer()
on my custom “language”? Or do I simply train the TransfoXLModel
and it’ll take care of custom tokenization? (I can’t see this being the case as its a completely unseen “language”)