Get intermediate tokens and merges used in tokenization

Hi Friends :wave:
Is there a way to get intermediate tokens and merges used during BPE tokenization?

Example:

  • vocab: “a”, “b”, “c”, “ab”, “”, “abc”
  • merges: “a b”, “b c”, “ab c”

What I want: tokenize(“abc”): {“intermediate_tokens”: [“a”, “b”, “ab”, “c”], “intermediate_merges”: [“a b”]}

I currently solve this by manually implementing BPE in Python, but my implementation is too slow :sweat_smile: