I am trying to train bert for token classification and I want to use full text and if necessary, split in two samples. For that I am using return_overflowing_tokens with a specific stride. I want to tokenize the labels as well for this. I have seen tokenize_and_align_labels function in the tutorial (Token classification) but this doesnāt take care of such overflow. Is there something already available for this? Or should I generate the sample without truncation and then split to a specific length later on?
@Sajan Hey Iām also working on the same issue. I tried the same approach and getting the same error where input_ids has different length due to sliding window chunks but the labels still remain the same length as that of dataset. Please let me know if the issue has been resolved and how did you tackled it?
ArrowInvalid: Column 2 named input_ids expected length 2 but got length 4
Hi @Raisa06 The error (ArrowInvalid: Column 2 named input_ids expected length 2 but got length 4
) you are getting seems different.
For this issue, the bert tokenizer returns word_ids function which gives the original index of pretokenized words and works with return_overflowing_tokens. (eg: tokenized_encoding.word_ids(2)
gives original index of input which can be mapped to labels. I then loop over the tokenized word ids and add corresponding label to the final labels list
Hi @Sajan Thanks for your reply. I get you. Now Iāve implemented the sliding windows and mapped labels to each window. For example,
-
len(example[āwordsā]) is 2 - i.e, two list with tokens and size more than maximum sequence length.
-
Introduced sliding windows during tokenization,
-
tokenized_inputs = tokenizer(examples[āwordsā], is_split_into_words=True, truncation=True, padding=āmax_lengthā, max_length=500, stride=200, return_overflowing_tokens=True, return_offsets_mapping=True)
-
so tokenized_input[āinput_idsā] - will have lets say 4 list with respect to sliding window configuration that Iāve set.
-
and Iāve mapped input_ids to word_ids and then to label manually (hugging face doesnāt have any functions to align labels for sliding windows it is generating)
-
will result in tokenized_input[ālabelsā] with 4 list
tokenized_dataset = datasets.map(tokenize_and_align_labels, batched=True)
More info on datasets,
DatasetDict({
train: Dataset({
features: [āwordsā, ālabelsā],
num_rows: 2
})
})
On calling the function again Iām getting the below error,
ArrowInvalid: Column 1 named labels expected length 2 but got length 4
My doubt here is, Is it an internal function level issue?
How do you map the input ids to word ids? Can you please share that code? Or better yet, the code snippet that does this processing