NER Label tokenization with overflowing tokens

I am trying to train bert for token classification and I want to use full text and if necessary, split in two samples. For that I am using return_overflowing_tokens with a specific stride. I want to tokenize the labels as well for this. I have seen tokenize_and_align_labels function in the tutorial (Token classification) but this doesnā€™t take care of such overflow. Is there something already available for this? Or should I generate the sample without truncation and then split to a specific length later on?

@Sajan Hey Iā€™m also working on the same issue. I tried the same approach and getting the same error where input_ids has different length due to sliding window chunks but the labels still remain the same length as that of dataset. Please let me know if the issue has been resolved and how did you tackled it?

ArrowInvalid: Column 2 named input_ids expected length 2 but got length 4

Hi @Raisa06 The error (ArrowInvalid: Column 2 named input_ids expected length 2 but got length 4) you are getting seems different.

For this issue, the bert tokenizer returns word_ids function which gives the original index of pretokenized words and works with return_overflowing_tokens. (eg: tokenized_encoding.word_ids(2) gives original index of input which can be mapped to labels. I then loop over the tokenized word ids and add corresponding label to the final labels list

Hi @Sajan Thanks for your reply. I get you. Now Iā€™ve implemented the sliding windows and mapped labels to each window. For example,

  1. len(example[ā€œwordsā€]) is 2 - i.e, two list with tokens and size more than maximum sequence length.

  2. Introduced sliding windows during tokenization,

  3. tokenized_inputs = tokenizer(examples[ā€œwordsā€], is_split_into_words=True, truncation=True, padding=ā€œmax_lengthā€, max_length=500, stride=200, return_overflowing_tokens=True, return_offsets_mapping=True)

  4. so tokenized_input[ā€œinput_idsā€] - will have lets say 4 list with respect to sliding window configuration that Iā€™ve set.

  5. and Iā€™ve mapped input_ids to word_ids and then to label manually (hugging face doesnā€™t have any functions to align labels for sliding windows it is generating)

  6. will result in tokenized_input[ā€œlabelsā€] with 4 list

tokenized_dataset = datasets.map(tokenize_and_align_labels, batched=True)

More info on datasets,

DatasetDict({
train: Dataset({
features: [ā€˜wordsā€™, ā€˜labelsā€™],
num_rows: 2
})
})

On calling the function again Iā€™m getting the below error,

ArrowInvalid: Column 1 named labels expected length 2 but got length 4

My doubt here is, Is it an internal function level issue?

How do you map the input ids to word ids? Can you please share that code? Or better yet, the code snippet that does this processing