Hi! After having read the page on Byte Pair Encoding, I thought I understood the idea but I’m confused by the list of merges of the Mistral-7B tokenizer… In https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2/raw/main/tokenizer.json, I see that the first merges listed are:
 0: ▁ t
 1: i n
 2: e r
 3: ▁ a
 4: h e
 5: o n
 6: r e
 7: ▁ s
 8: e n
 9: a t
10: o r
11: ▁t he
12: ▁th e
13: ▁ the
14: e s
15: ▁ w
16: a n
17: ▁ c
18: i s
19: i t
20: o u
21: ▁ d
22: a l
23: a r
24: ▁ p
25: ▁ f
26: e d
27: ▁ b
28: in g
29: i ng
What I don’t understand is how merge #12 can ever be useful because the _th token cannot have been created at this point of the encoding process (there are no _ th or _t h merges before #12). It’s the same for #13 and #29. I count that 10312 merges (out of 58980 merges in total) are in this situation.
Can someone please help me understand?