Hello Huggingface,
I try to solve a token classification task where the documents are longer than the model’s max length.
I modified the tokenize_and_align_labels
function from example token classification notebook. I set the tokenizer option return_overflowing_tokens=True
and rewrote the function to map labels for the overflowing tokens:
tokenizer_settings = {'is_split_into_words':True,'return_offsets_mapping':True,
'padding':True, 'truncation':True, 'stride':0,
'max_length':tokenizer.model_max_length, 'return_overflowing_tokens':True}
def tokenize_and_align_labels(examples):
tokenized_inputs = tokenizer(examples["tokens"], **tokenizer_settings)
labels = []
for i,document in enumerate(tokenized_inputs.encodings):
doc_encoded_labels = []
last_word_id = None
for word_id in document.word_ids:
if word_id == None: #or last_word_id == word_id:
doc_encoded_labels.append(-100)
else:
document_id = tokenized_inputs.overflow_to_sample_mapping[i]
label = examples[task][document_id][word_id]
doc_encoded_labels.append(int(label))
last_word_id = word_id
labels.append(doc_encoded_labels)
tokenized_inputs["labels"] = labels
return tokenized_inputs
Executing this code will result in an error:
exception has occurred: ArrowInvalid
Column 5 named task1 expected length 820 but got length 30
It looks like the input 30 examples can’t be mapped to the 820 examples after the slicing. How can I solve this issue?
Environment info
Google Colab running this notbook
To reproduce
Steps to reproduce the behaviour:
- Replace the tokenize_and_align_labels function with the function given above.
- Add examples longer than max_length
- run
tokenized_datasets = datasets.map(tokenize_and_align_labels, batched=True)
cell.