Inputs.word_ids() length not matching word label length

Hi there, Iā€™m having trouble with the tokenizer word_ids referring to words not in the index. To elaborate, hereā€™s code and error message:

from transformers import AutoTokenizer
model_checkpoint = ā€œbionlp/bluebert_pubmed_uncased_L-24_H-1024_A-16ā€
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

tokenized_column_list =
word_id_list =
input_id_list =
token_type_id_list =
attention_mask_list =

for index, row in ta_train_df.iterrows():
text = row[ā€˜Raw textā€™].lower()
inputs = tokenizer(text)
tokenized_column_list.append(inputs.tokens())
word_id_list.append(inputs.word_ids())
input_id_list.append(inputs[ā€˜input_idsā€™])
token_type_id_list.append(inputs[ā€˜token_type_idsā€™])
attention_mask_list.append(inputs[ā€˜attention_maskā€™])

ta_train_df[ā€˜Tokenized_textā€™] = tokenized_column_list
ta_train_df[ā€˜Word_IDā€™] = word_id_list
ta_train_df[ā€˜input_idsā€™] = input_id_list
ta_train_df[ā€˜token_type_idsā€™] = token_type_id_list
ta_train_df[ā€˜attention_maskā€™] = attention_mask_list

label_to_numeric = {ā€˜Oā€™:0, ā€˜-ā€™: -100, ā€˜B-rxā€™:1ā€¦ etc

def convert_labels_to_numeric(labels):
numeric_labels =
for label in labels:
if label in label_to_numeric:
numeric_labels.append(label_to_numeric[label])
else:
numeric_labels.append(-100)
return numeric_labels

ta_train_df[ā€˜numeric_iob_tagsā€™] = ta_train_df[ā€˜iob_tagsā€™].apply(convert_labels_to_numeric)

def align_labels_with_tokens(labels, word_ids):
new_labels =
current_word = None
for word_id in word_ids:
if word_id != current_word:
# Start of a new word!
current_word = word_id
label = -100 if word_id is None else labels[word_id]
new_labels.append(label)
elif word_id is None:
# Special token
new_labels.append(-100)
else:
# Same word as previous token
label = labels[word_id]
# If the label is B-XXX we change it to I-XXX
if label % 2 == 1:
label += 1
new_labels.append(label)
return new_labels

labels = ta_train_df[ā€˜numeric_iob_tagsā€™][0]
word_ids = ta_train_df[ā€˜Word_IDā€™][0]

align_labels_with_tokens(labels, word_ids)


IndexError Traceback (most recent call last)
in <cell line: 6>()
4 print(labels)
5 print(word_ids)
----> 6 align_labels_with_tokens(labels, word_ids)

in align_labels_with_tokens(labels, word_ids)
7 current_word = word_id
8 print(current_word)
----> 9 label = -100 if word_id is None else labels[word_id]
10 new_labels.append(label)
11 elif word_id is None:

IndexError: list index out of range


The input.word_ids are referring to word 356, whereas there are only 352 words in the labels. Iā€™m not sure where the tokenizer is producing 4 extra words. The text in df_train_df[ā€˜Raw textā€™][0] is this:

'acute. confirmed CES. 1. DECOMPRESSION SURGERY

Decompression surgery should be performed as soon as possible. Although surgery within 48 hours of symptom onset has been used by some clinicians as a guide, this has been challenged, and remains controversial. It is likely that the level of neurologic dysfunction at the time of surgery (rather than time since symptom onset) is the most significant determinant of prognosis. One retrospective cohort study of 20,924 patients with CES reported that patients undergoing surgical decompression on hospital day 0 or 1 had better improved inpatient outcomes, including lower complication and mortality rates, than patients having surgery on day 2 or later. Evidence on the benefits of earlier surgery (e.g., within 24 hours) is equivocal. This may be due to differences in neurologic dysfunction among participants; some studies suggest that surgery within 24 hours of symptom onset may reduce postoperative bladder dysfunction in patients with incomplete CES, but not in patients with CES with urinary retention, compared with surgery between 24 and 48 hours.

Therefore, as the 48-hour time window is controversial, urgent surgery should not be delayed, especially since the precise time of symptom onset can be difficult to define. British Association of Spine Surgeons guidelines recommend that surgery should take place as soon as possible, while taking into account the duration and clinical course of symptoms and signs, as well as the potential for increased morbidity when operating at night.

The goal of surgery is to alleviate compression of the cauda equina, which may be achieved through a number of surgical techniques (e.g., wide-decompressive laminectomy, lumbar microdiskectomy). The appropriate surgical technique should be chosen based on pathology and the experience of the surgeon.

Intraoperative monitoring of somatosensory and motor-evoked potentials allows for evaluation of radiculopathy and neuropathy, but is not a necessary part of urgent procedures.'

I think Iā€™ve figures this out. Itā€™s due to special characters in the raw text, such as ā€˜~~~ā€™, ā€˜{{ā€™, ā€˜}}ā€™, ā€˜|ā€™ , ā€˜20,357ā€™, etc. It turn out my text labels and the tokenizers handle these cases differently - where the tokenizer separates them and counts them individually, but my labels donā€™t.

Hi

Have you been able to solve this problem? Did you write a new function, or created a new tokenizer?

Hi, I didnā€™t create a new function or tokenizer, I just cleaned my raw text a bit better to get rid of the special characters