Hi there, Iām having trouble with the tokenizer word_ids referring to words not in the index. To elaborate, hereās code and error message:
from transformers import AutoTokenizer
model_checkpoint = ābionlp/bluebert_pubmed_uncased_L-24_H-1024_A-16ā
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
tokenized_column_list =
word_id_list =
input_id_list =
token_type_id_list =
attention_mask_list =
for index, row in ta_train_df.iterrows():
text = row[āRaw textā].lower()
inputs = tokenizer(text)
tokenized_column_list.append(inputs.tokens())
word_id_list.append(inputs.word_ids())
input_id_list.append(inputs[āinput_idsā])
token_type_id_list.append(inputs[ātoken_type_idsā])
attention_mask_list.append(inputs[āattention_maskā])
ta_train_df[āTokenized_textā] = tokenized_column_list
ta_train_df[āWord_IDā] = word_id_list
ta_train_df[āinput_idsā] = input_id_list
ta_train_df[ātoken_type_idsā] = token_type_id_list
ta_train_df[āattention_maskā] = attention_mask_list
label_to_numeric = {āOā:0, ā-ā: -100, āB-rxā:1ā¦ etc
def convert_labels_to_numeric(labels):
numeric_labels =
for label in labels:
if label in label_to_numeric:
numeric_labels.append(label_to_numeric[label])
else:
numeric_labels.append(-100)
return numeric_labels
ta_train_df[ānumeric_iob_tagsā] = ta_train_df[āiob_tagsā].apply(convert_labels_to_numeric)
def align_labels_with_tokens(labels, word_ids):
new_labels =
current_word = None
for word_id in word_ids:
if word_id != current_word:
# Start of a new word!
current_word = word_id
label = -100 if word_id is None else labels[word_id]
new_labels.append(label)
elif word_id is None:
# Special token
new_labels.append(-100)
else:
# Same word as previous token
label = labels[word_id]
# If the label is B-XXX we change it to I-XXX
if label % 2 == 1:
label += 1
new_labels.append(label)
return new_labels
labels = ta_train_df[ānumeric_iob_tagsā][0]
word_ids = ta_train_df[āWord_IDā][0]
align_labels_with_tokens(labels, word_ids)
IndexError Traceback (most recent call last)
in <cell line: 6>()
4 print(labels)
5 print(word_ids)
----> 6 align_labels_with_tokens(labels, word_ids)
in align_labels_with_tokens(labels, word_ids)
7 current_word = word_id
8 print(current_word)
----> 9 label = -100 if word_id is None else labels[word_id]
10 new_labels.append(label)
11 elif word_id is None:
IndexError: list index out of range
The input.word_ids are referring to word 356, whereas there are only 352 words in the labels. Iām not sure where the tokenizer is producing 4 extra words. The text in df_train_df[āRaw textā][0] is this:
'acute. confirmed CES. 1. DECOMPRESSION SURGERY
Decompression surgery should be performed as soon as possible. Although surgery within 48 hours of symptom onset has been used by some clinicians as a guide, this has been challenged, and remains controversial. It is likely that the level of neurologic dysfunction at the time of surgery (rather than time since symptom onset) is the most significant determinant of prognosis. One retrospective cohort study of 20,924 patients with CES reported that patients undergoing surgical decompression on hospital day 0 or 1 had better improved inpatient outcomes, including lower complication and mortality rates, than patients having surgery on day 2 or later. Evidence on the benefits of earlier surgery (e.g., within 24 hours) is equivocal. This may be due to differences in neurologic dysfunction among participants; some studies suggest that surgery within 24 hours of symptom onset may reduce postoperative bladder dysfunction in patients with incomplete CES, but not in patients with CES with urinary retention, compared with surgery between 24 and 48 hours.
Therefore, as the 48-hour time window is controversial, urgent surgery should not be delayed, especially since the precise time of symptom onset can be difficult to define. British Association of Spine Surgeons guidelines recommend that surgery should take place as soon as possible, while taking into account the duration and clinical course of symptoms and signs, as well as the potential for increased morbidity when operating at night.
The goal of surgery is to alleviate compression of the cauda equina, which may be achieved through a number of surgical techniques (e.g., wide-decompressive laminectomy, lumbar microdiskectomy). The appropriate surgical technique should be chosen based on pathology and the experience of the surgeon.
Intraoperative monitoring of somatosensory and motor-evoked potentials allows for evaluation of radiculopathy and neuropathy, but is not a necessary part of urgent procedures.'