Find the eqivalent for word.index in BERT?

toyl · December 28, 2021, 6:28am

excuse me i need to get like a dictionary contains the word with its index like this

def word_for_id(integer, tokenizer):
    for word, index in tokenizer.word_index.items():
        if index == integer:
            return word
    return None

but with using BERT i couldn’t find the equivalent as i got berttokenizer' object has no attribute 'word_index'

jon-fernandes · December 29, 2021, 7:07pm

BERT has word-piece tokens, so if you are after the associated IDs for these word-piece tokens, you can find these.

from transformers import AutoModel
checkpoint = 'bert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
tokenizer.vocab

toyl · December 29, 2021, 11:18pm

thanks a lot for replying … excuse me what is the difference between tokenizer.get_vocab() and tokenizer. vocab ? are both work the same?

toyl · December 31, 2021, 4:25pm

excuse me do i need to load my sentences to get the vocab or the vocab here for the pre-trained bert ?

jon-fernandes · December 31, 2021, 9:52pm

From a very quick scan, tokenizer.vocab is an attribute. tokenizer.get_vocab() is a function. They both return a dictionary with the same number of items. I’m not sure why we need both.

Topic		Replies	Views
Inputs.word_ids() length not matching word label length 🤗Tokenizers	3	530	March 22, 2024
Issue with Extracting Word Ids from Batch Encoding Object Beginners	2	1013	November 1, 2022
Chapter.6 - Why are the tokens and word_ids for 2nd sentence are not returned? Course	0	445	January 3, 2023
How to get the index of the masked token after passing the sentence to the model 🤗Transformers	3	2821	September 8, 2020
Index of wordpieces (subwords) after tokenization by transformers 🤗Tokenizers	0	699	August 28, 2021

Find the eqivalent for word.index in BERT?

Related topics