I am very sorry that this example is confusing.
To give back some context, here we want to show with a very small example how the Unigram algorithm works.
This algorithm, starts with an initial vocabulary which is usually determined by a BPE algorithm. To avoid complicating the toy example here we wanted to take a simpler rule which is “take all strict substrings for the initial vocabulary”.
In concrete terms, we have listed all the strict substrings of the words in the corpus:
- the strict substrings of
"hug"
are['h', 'u', 'g', 'hu', 'ug']
- the strict substrings of
"pug"
are['p', 'u', 'g', 'pu', 'ug']
- the strict substrings of
"pun"
are['p', 'u', 'n', 'pu', 'un']
- the strict substrings of
"bun"
are['b', 'u', 'n', 'bu', 'un']
- the strict substrings of
"hugs"
are['h', 'u', 'g', 's', 'hu', 'ug', 'gs', 'hug', 'ugs']
By merging these lists of strict substrings and by deleting the duplicates, we end up with the initial vocabulary of ['n', 'b', 'g', 'u', 's', 'p', 'h', 'un', 'gs', 'hu', 'ug', 'bu', 'pu', 'ugs', 'hug']
.
Now that we have this list our initial vocabulary, we can forget about the notion of strick substrings and move on to the second part of the Unigram algorithm which starts with the calculation of frequencies.
Does this make more sense?