Chapter 6 questions

SaulLu · February 6, 2023, 11:20am

I am very sorry that this example is confusing.

To give back some context, here we want to show with a very small example how the Unigram algorithm works.

This algorithm, starts with an initial vocabulary which is usually determined by a BPE algorithm. To avoid complicating the toy example here we wanted to take a simpler rule which is “take all strict substrings for the initial vocabulary”.

In concrete terms, we have listed all the strict substrings of the words in the corpus:

the strict substrings of "hug" are ['h', 'u', 'g', 'hu', 'ug']
the strict substrings of "pug" are ['p', 'u', 'g', 'pu', 'ug']
the strict substrings of "pun" are ['p', 'u', 'n', 'pu', 'un']
the strict substrings of "bun" are ['b', 'u', 'n', 'bu', 'un']
the strict substrings of "hugs" are ['h', 'u', 'g', 's', 'hu', 'ug', 'gs', 'hug', 'ugs']

By merging these lists of strict substrings and by deleting the duplicates, we end up with the initial vocabulary of ['n', 'b', 'g', 'u', 's', 'p', 'h', 'un', 'gs', 'hu', 'ug', 'bu', 'pu', 'ugs', 'hug'].

Now that we have this list our initial vocabulary, we can forget about the notion of strick substrings and move on to the second part of the Unigram algorithm which starts with the calculation of frequencies.

Does this make more sense?

Topic		Replies	Views
Chapter 7 questions Course	119	10406	July 10, 2025
Tokenizer unigram tutorial encode_word function question Beginners	0	91	May 11, 2024
Train Retry Tokenizer 🤗Tokenizers	0	224	April 18, 2023
SentencePieceUnigramTokenizer 🤗Tokenizers	0	699	September 22, 2022
Chapter 3 questions Course	149	10519	August 29, 2025

Chapter 6 questions

Related topics