Tokenizer unigram tutorial encode_word function question

Nevermetyou · May 11, 2024, 9:30am

Hi

I just finish reading this tutorial on Unigram.

And I have a question about this line in function encode_word()

best_segmentations = [{"start": 0, "score": 1}] + [
        {"start": None, "score": None} for _ in range(len(word))
    ]

As per my understanding, the score inside dictionary of the first list should be log(1) not 1 ???

Because in this line
score = model[token] + best_score_at_start
we are summing the log of probability.

So I suspect that [{"start": 0, "score": 1}] should be [{"start": 0, "score": 0}]

Can someone clarify me this matter?

Thanks

Topic		Replies	Views
Chapter 6 questions Course	51	5156	February 27, 2025
Initialize Vocabulary for Unigram Tokenizer 🤗Tokenizers	0	298	July 11, 2023
MarianTokenizer sentencepiece model Beginners	0	264	November 4, 2021
SentencePieceUnigramTokenizer 🤗Tokenizers	0	688	September 22, 2022
EM training on unigram tokenizer taking way longer than predicted 🤗Tokenizers	0	480	June 23, 2022