Hi! After having read the page on Byte Pair Encoding, I thought I understood the idea but I’m confused by the list of merges of the Mistral-7B tokenizer… In https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2/raw/main/tokenizer.json, I see that the first merges listed are:
0: ▁ t
1: i n
2: e r
3: ▁ a
4: h e
5: o n
6: r e
7: ▁ s
8: e n
9: a t
10: o r
11: ▁t he
12: ▁th e
13: ▁ the
14: e s
15: ▁ w
16: a n
17: ▁ c
18: i s
19: i t
20: o u
21: ▁ d
22: a l
23: a r
24: ▁ p
25: ▁ f
26: e d
27: ▁ b
28: in g
29: i ng
What I don’t understand is how merge #12 can ever be useful because the _th
token cannot have been created at this point of the encoding process (there are no _ th
or _t h
merges before #12). It’s the same for #13 and #29. I count that 10312 merges (out of 58980 merges in total) are in this situation.
Can someone please help me understand?