Get intermediate tokens and merges used in tokenization

cwallenwein · December 1, 2023, 1:40pm

Hi Friends
Is there a way to get intermediate tokens and merges used during BPE tokenization?

Example:

What I want: tokenize(“abc”): {“intermediate_tokens”: [“a”, “b”, “ab”, “c”], “intermediate_merges”: [“a b”]}

I currently solve this by manually implementing BPE in Python, but my implementation is too slow

Topic		Replies	Views
How to create a HF tokenizer's vocab file from a BPE model's merges.txt file? 🤗Tokenizers	0	475	May 13, 2023
How to properly clean vocabulary from BBPE tokenizer 🤗Tokenizers	3	1041	October 1, 2022
How do I remove tokens from a BPE Tokenizer's vocabulary? 🤗Tokenizers	2	596	July 3, 2024
Tokenizer tend to choose added tokens first rather than token in vocab 🤗Tokenizers	1	545	November 30, 2023
Loading BPE modeled Tokenizer results in empty tokenizer 🤗Tokenizers	0	327	April 15, 2024