Issue with Extracting Word Ids from Batch Encoding Object

I’m not sure if I’m doing something wrong, but for some reason when I go to extract the word_ids across my dataset, it only returns the first entry.

tokenizer = BertTokenizerFast.from_pretrained("dslim/bert-base-NER")
dataset = load_dataset("wnut_17")

#returns a list of input_ids of length 1287 each entry is length 512
tokenized_input = tokenizer(dataset["tokens"], padding=max_length, truncation=true, is_split_into_words=true) 

#returns a lit of words ids of 512, when I investigated it's only returning the first 
# entry from the tokenized input
word_ids = tokenized_input.word_ids() 

#if I put it into a list comprehension it works as expected Returning a list of length 1287
#where each element is 512
word_ids=[tokenized_input[i].word_ids for i in range(len(tokenized_input['input_ids']))]

Does anyone have any thoughts to what I might be doing wrong? In the example here, they use the same call as me but it works?

1 Like

having the same issue, did you find out how to solve it?

Not really… I just ended up writing a list comprehension. It’s hacky but it works

word_ids = [
tokenized_input[i].word_ids
for i in range(len(tokenized_input[“input_ids”]))
]

1 Like