How do you use SentencePiece for BPE of sequences with no whitespace

I am trying to use byte pair encoding on amino acid sequences which have no spaces:

ADNRRPIWNLGHMVNALKQIPTFLXDGANA

the tokenizers summary section of the docs states suggests SentencePiece could be useful, as it treats the input as a raw stream, includes the space in the set of characters to use, then uses BPE or unigram to construct the appropriate vocabulary.

How would I train a tokenizer from scratch using SentencePiece? The tokenizer library seems to only support
WordPiece.

1 Like

In original sentencepiece model, white space is considered as a regular character. Please read the description here.

I am not totally familier with the huggingface implementation of the sentencepiece. But you can use the original sentencepiece library for that and then try loading that sentencepiece model by huggingface wrapper if needed.