toyl
December 28, 2021, 6:28am
1
excuse me i need to get like a dictionary contains the word with its index like this
def word_for_id(integer, tokenizer):
for word, index in tokenizer.word_index.items():
if index == integer:
return word
return None
but with using BERT i couldn’t find the equivalent as i got berttokenizer' object has no attribute 'word_index'
BERT has word-piece tokens, so if you are after the associated IDs for these word-piece tokens, you can find these.
from transformers import AutoModel
checkpoint = 'bert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
tokenizer.vocab
1 Like
toyl
December 29, 2021, 11:18pm
3
thanks a lot for replying … excuse me what is the difference between tokenizer.get_vocab() and tokenizer. vocab ? are both work the same?
toyl
December 31, 2021, 4:25pm
4
excuse me do i need to load my sentences to get the vocab or the vocab here for the pre-trained bert ?
From a very quick scan, tokenizer.vocab is an attribute. tokenizer.get_vocab() is a function. They both return a dictionary with the same number of items. I’m not sure why we need both.