Issue with Extracting Word Ids from Batch Encoding Object

courtneysprouse131 · July 12, 2022, 2:36pm

I’m not sure if I’m doing something wrong, but for some reason when I go to extract the word_ids across my dataset, it only returns the first entry.

tokenizer = BertTokenizerFast.from_pretrained("dslim/bert-base-NER")
dataset = load_dataset("wnut_17")

#returns a list of input_ids of length 1287 each entry is length 512
tokenized_input = tokenizer(dataset["tokens"], padding=max_length, truncation=true, is_split_into_words=true) 

#returns a lit of words ids of 512, when I investigated it's only returning the first 
# entry from the tokenized input
word_ids = tokenized_input.word_ids() 

#if I put it into a list comprehension it works as expected Returning a list of length 1287
#where each element is 512
word_ids=[tokenized_input[i].word_ids for i in range(len(tokenized_input['input_ids']))]

Does anyone have any thoughts to what I might be doing wrong? In the example here, they use the same call as me but it works?

MazH24 · November 1, 2022, 3:00pm

having the same issue, did you find out how to solve it?

courtneysprouse131 · November 1, 2022, 3:26pm

Not really… I just ended up writing a list comprehension. It’s hacky but it works

word_ids = [
tokenized_input[i].word_ids
for i in range(len(tokenized_input[“input_ids”]))
]

Topic		Replies	Views
Chapter.6 - Why are the tokens and word_ids for 2nd sentence are not returned? Course	0	445	January 3, 2023
Inputs.word_ids() length not matching word label length 🤗Tokenizers	3	530	March 22, 2024
The inputs into BERT are token IDs. How do we get the corresponding input token VECTORS? Beginners	10	17706	September 15, 2022
Preprocessing of dataset 🤗Tokenizers	0	172	April 10, 2024
Missing, yet not missing, input_ids 🤗Transformers	2	1324	June 14, 2024

Issue with Extracting Word Ids from Batch Encoding Object

Related topics