Check Vocabulary of a model

Is there a way to check if a model does know a specific word?
I’ve got a list of names and for evaluation I would like to check if the model does know any of these names.

I’m currently training bert-base-german-cased which I want to check before and after. After the training the names should be known, as they are included in the training set.

Edit:
I tried the following code:

def check(name):
    name_splitted = name.split()
    for s in name_splitted:
        if not s in tokenizer.vocab.keys():
            print(f'{name} not found')
            return False
    return True

passed = []
for name in names:
    result = check(name)
    if result:
        passed.append(name)

I don’t think this is a good solution though. Names like “Microsoft” or “Google” are being found while “Walmart” and “FedEx” are not.

bert-base-german-cased, as most of the models these days, uses subword-tokenization. Meaning that the vocabulary of the tokenizer consists not just out of top-N most frequent words, but out of subwords, like BPE-tokens. This way it is possible to encode almost any string into a sequence of tokens, because the tokenizer splits a word until it is tokenizable.

So even though this tokenizer might not have a token “FedEx”, it can still encode it based on the subwords. For example “Fed#” “Ex”, where # is a special symbol that says “this token is not the last token of the word”.

You can learn more about subword tokenizers in this video: Subword-based tokenizers - YouTube

1 Like