Use a pretrained ByteLevelBPETokenizer on text

abdallah197 · July 17, 2020, 7:52am

Hi
I am asking whether there’s a simple way to tokenize a piece of text “I will go to the bedroom” to BPE " I will go to the bed ##room" without training a tokenizer from scratch.

anthony · July 17, 2020, 2:20pm

Hi @abdallah197!

There is a bunch of pre-trained tokenizers in the huggingface/transformers library that you can use directly, without having to train anything. You won’t have any control over how the tokens are split though, as this is based on what the tokenizer learned during training, and the size of its vocabulary. bedroom isn’t really a rare word, so often it will have its own token in the vocabulary.

Your example looks like a WordPiece (instead of BPE), given the ## in ##room which is very specific to this kind of tokenizers. You can try to use those from BERT in the library to see if anything fits your needs, for example with:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

tokenizer.tokenize("I will go to the bedroom")
# ['I', 'will', 'go', 'to', 'the', 'bedroom']

tokenizer.tokenize("I will go to the Bedroom")
# ['I', 'will', 'go', 'to', 'the', 'Bed', '##room']

Topic		Replies	Views
Issues with BPE tokenizer 🤗Tokenizers	2	273	January 24, 2024
Training a tokenizer Beginners	1	446	August 3, 2022
BpeTrainer implementation in Python 🤗Tokenizers	0	376	July 23, 2021
Using HuggingFace Tokenizers Without Special Characters 🤗Tokenizers	2	1943	November 2, 2022
Tokenized sequence lengths 🤗Tokenizers	6	2039	March 10, 2022

Use a pretrained ByteLevelBPETokenizer on text

Related topics