NER Label tokenization with overflowing tokens

Sajan · August 12, 2022, 2:16pm

I am trying to train bert for token classification and I want to use full text and if necessary, split in two samples. For that I am using return_overflowing_tokens with a specific stride. I want to tokenize the labels as well for this. I have seen tokenize_and_align_labels function in the tutorial (Token classification) but this doesn’t take care of such overflow. Is there something already available for this? Or should I generate the sample without truncation and then split to a specific length later on?

Raisa06 · November 5, 2023, 5:21pm

@Sajan Hey I’m also working on the same issue. I tried the same approach and getting the same error where input_ids has different length due to sliding window chunks but the labels still remain the same length as that of dataset. Please let me know if the issue has been resolved and how did you tackled it?

ArrowInvalid: Column 2 named input_ids expected length 2 but got length 4

Sajan · November 6, 2023, 1:42pm

Hi @Raisa06 The error (ArrowInvalid: Column 2 named input_ids expected length 2 but got length 4) you are getting seems different.

For this issue, the bert tokenizer returns word_ids function which gives the original index of pretokenized words and works with return_overflowing_tokens. (eg: tokenized_encoding.word_ids(2) gives original index of input which can be mapped to labels. I then loop over the tokenized word ids and add corresponding label to the final labels list

Raisa06 · November 6, 2023, 3:15pm

Hi @Sajan Thanks for your reply. I get you. Now I’ve implemented the sliding windows and mapped labels to each window. For example,

len(example[“words”]) is 2 - i.e, two list with tokens and size more than maximum sequence length.
Introduced sliding windows during tokenization,
tokenized_inputs = tokenizer(examples[“words”], is_split_into_words=True, truncation=True, padding=“max_length”, max_length=500, stride=200, return_overflowing_tokens=True, return_offsets_mapping=True)
so tokenized_input[“input_ids”] - will have lets say 4 list with respect to sliding window configuration that I’ve set.
and I’ve mapped input_ids to word_ids and then to label manually (hugging face doesn’t have any functions to align labels for sliding windows it is generating)
will result in tokenized_input[“labels”] with 4 list

tokenized_dataset = datasets.map(tokenize_and_align_labels, batched=True)

More info on datasets,

DatasetDict({
train: Dataset({
features: [‘words’, ‘labels’],
num_rows: 2
})
})

On calling the function again I’m getting the below error,

ArrowInvalid: Column 1 named labels expected length 2 but got length 4

My doubt here is, Is it an internal function level issue?

Sajan · November 6, 2023, 4:36pm

How do you map the input ids to word ids? Can you please share that code? Or better yet, the code snippet that does this processing

Topic		Replies	Views
Handle overflowing tokens Beginners	0	125	May 29, 2024
Token Classification: How to tokenize and align labels with overflow and stride? 🤗Tokenizers	4	6130	July 22, 2024
Implementing sliding window to BERT for NER Beginners	0	977	May 31, 2023
BERT for NER output of only '0' Beginners	0	670	November 14, 2021
ArrowInvalid: Column 1 named id expected length 512 but got length 1000 🤗Datasets	4	15214	June 6, 2024

NER Label tokenization with overflowing tokens

Related topics