Hi Friends
Is there a way to get intermediate tokens and merges used during BPE tokenization?
Example:
- vocab: “a”, “b”, “c”, “ab”, “”, “abc”
- merges: “a b”, “b c”, “ab c”
What I want: tokenize(“abc”): {“intermediate_tokens”: [“a”, “b”, “ab”, “c”], “intermediate_merges”: [“a b”]}
I currently solve this by manually implementing BPE in Python, but my implementation is too slow