How do you use SentencePiece for BPE of sequences with no whitespace

wjs20 · November 4, 2020, 5:17pm

I am trying to use byte pair encoding on amino acid sequences which have no spaces:

ADNRRPIWNLGHMVNALKQIPTFLXDGANA

the tokenizers summary section of the docs states suggests SentencePiece could be useful, as it treats the input as a raw stream, includes the space in the set of characters to use, then uses BPE or unigram to construct the appropriate vocabulary.

How would I train a tokenizer from scratch using SentencePiece? The tokenizer library seems to only support
WordPiece.

sbmaruf · April 29, 2021, 4:50pm

In original sentencepiece model, white space is considered as a regular character. Please read the description here.

I am not totally familier with the huggingface implementation of the sentencepiece. But you can use the original sentencepiece library for that and then try loading that sentencepiece model by huggingface wrapper if needed.

Topic		Replies	Views
SentencePiece to Tokenizers conversion 🤗Tokenizers	0	95	March 14, 2025
How would you train a sentencepiece BPE tokenizer on this language with 400 "characters"? 🤗Tokenizers	0	2969	February 13, 2022
MarianTokenizer sentencepiece model Beginners	0	264	November 4, 2021
Training a tokenizer Beginners	1	446	August 3, 2022
How to reconstruct a sentence after it is encoded using BPE? Beginners	2	824	April 18, 2023

How do you use SentencePiece for BPE of sequences with no whitespace

Related topics