Check Vocabulary of a model

giklo · April 8, 2022, 1:47pm

Is there a way to check if a model does know a specific word?
I’ve got a list of names and for evaluation I would like to check if the model does know any of these names.

I’m currently training bert-base-german-cased which I want to check before and after. After the training the names should be known, as they are included in the training set.

Edit:
I tried the following code:

def check(name):
    name_splitted = name.split()
    for s in name_splitted:
        if not s in tokenizer.vocab.keys():
            print(f'{name} not found')
            return False
    return True

passed = []
for name in names:
    result = check(name)
    if result:
        passed.append(name)

I don’t think this is a good solution though. Names like “Microsoft” or “Google” are being found while “Walmart” and “FedEx” are not.

dropout05 · April 8, 2022, 5:38pm

bert-base-german-cased, as most of the models these days, uses subword-tokenization. Meaning that the vocabulary of the tokenizer consists not just out of top-N most frequent words, but out of subwords, like BPE-tokens. This way it is possible to encode almost any string into a sequence of tokens, because the tokenizer splits a word until it is tokenizable.

So even though this tokenizer might not have a token “FedEx”, it can still encode it based on the subwords. For example “Fed#” “Ex”, where # is a special symbol that says “this token is not the last token of the word”.

You can learn more about subword tokenizers in this video: Subword-based tokenizers - YouTube

Topic		Replies	Views
Load pretrained model's tokenizer with or without vocabulary? Beginners	2	145	August 30, 2024
Do you have to use a model card's accompanying tokenizer? Beginners	1	307	November 4, 2022
Bert pretrained tokenizer: how to preserve hyphened words? Beginners	0	311	April 6, 2022
Questions about the connection between tokenizer and the model Beginners	0	308	September 19, 2023
Do you need to use the associated tokenizer Beginners	2	567	June 6, 2022

Check Vocabulary of a model

Related topics