I am asking whether there’s a simple way to tokenize a piece of text “I will go to the bedroom” to BPE " I will go to the bed ##room" without training a tokenizer from scratch.
There is a bunch of pre-trained tokenizers in the huggingface/transformers library that you can use directly, without having to train anything. You won’t have any control over how the tokens are split though, as this is based on what the tokenizer learned during training, and the size of its vocabulary.
bedroom isn’t really a rare word, so often it will have its own token in the vocabulary.
Your example looks like a WordPiece (instead of BPE), given the
##room which is very specific to this kind of tokenizers. You can try to use those from
BERT in the library to see if anything fits your needs, for example with:
from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("bert-base-cased") tokenizer.tokenize("I will go to the bedroom") # ['I', 'will', 'go', 'to', 'the', 'bedroom'] tokenizer.tokenize("I will go to the Bedroom") # ['I', 'will', 'go', 'to', 'the', 'Bed', '##room']