I am trying to use byte pair encoding on amino acid sequences which have no spaces:
ADNRRPIWNLGHMVNALKQIPTFLXDGANA
the tokenizers summary section of the docs states suggests SentencePiece could be useful, as it treats the input as a raw stream, includes the space in the set of characters to use, then uses BPE or unigram to construct the appropriate vocabulary.
How would I train a tokenizer from scratch using SentencePiece? The tokenizer library seems to only support
WordPiece.