Question on byte pair encoding (chapter 6.5)

vivien · January 11, 2024, 8:13pm

Hi! After having read the page on Byte Pair Encoding, I thought I understood the idea but I’m confused by the list of merges of the Mistral-7B tokenizer… In https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2/raw/main/tokenizer.json, I see that the first merges listed are:

 0: ▁ t
 1: i n
 2: e r
 3: ▁ a
 4: h e
 5: o n
 6: r e
 7: ▁ s
 8: e n
 9: a t
10: o r
11: ▁t he
12: ▁th e
13: ▁ the
14: e s
15: ▁ w
16: a n
17: ▁ c
18: i s
19: i t
20: o u
21: ▁ d
22: a l
23: a r
24: ▁ p
25: ▁ f
26: e d
27: ▁ b
28: in g
29: i ng

What I don’t understand is how merge #12 can ever be useful because the _th token cannot have been created at this point of the encoding process (there are no _ th or _t h merges before #12). It’s the same for #13 and #29. I count that 10312 merges (out of 58980 merges in total) are in this situation.

Can someone please help me understand?

Topic		Replies	Views
Decoding sequence of tokens produces question marks instead of actual tokens 🤗Tokenizers	1	26	September 3, 2024
Token merging for fast LLM inference Research	0	493	April 17, 2024
Mistral 7B RAG Langchaing Models	0	2624	February 20, 2024
Trouble with mergekit Models	0	43	July 24, 2024
Data did not match any variant of untagged enum PyPreTokenizerTypeWrapper at line 6952 column 3 Models	1	1181	July 4, 2024

Question on byte pair encoding (chapter 6.5)

Related topics