Token Classification: How to tokenize and align labels with overflow and stride?

Hello Huggingface,
I try to solve a token classification task where the documents are longer than the modelā€™s max length.
I modified the tokenize_and_align_labels function from example token classification notebook. I set the tokenizer option return_overflowing_tokens=True and rewrote the function to map labels for the overflowing tokens:

tokenizer_settings = {'is_split_into_words':True,'return_offsets_mapping':True, 
                        'padding':True, 'truncation':True, 'stride':0, 
                        'max_length':tokenizer.model_max_length, 'return_overflowing_tokens':True}

def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(examples["tokens"], **tokenizer_settings)

    labels = []
    for i,document in enumerate(tokenized_inputs.encodings):
       doc_encoded_labels = []
       last_word_id = None
       for word_id  in document.word_ids:            
           if word_id == None: #or last_word_id == word_id:
               doc_encoded_labels.append(-100)        
           else:
               document_id = tokenized_inputs.overflow_to_sample_mapping[i]
               label = examples[task][document_id][word_id]               
               doc_encoded_labels.append(int(label))
           last_word_id = word_id
       labels.append(doc_encoded_labels)
    
    tokenized_inputs["labels"] = labels    
    return tokenized_inputs

Executing this code will result in an error:

exception has occurred: ArrowInvalid
Column 5 named task1 expected length 820 but got length 30

It looks like the input 30 examples canā€™t be mapped to the 820 examples after the slicing. How can I solve this issue?

Environment info

Google Colab running this notbook

To reproduce

Steps to reproduce the behaviour:

  1. Replace the tokenize_and_align_labels function with the function given above.
  2. Add examples longer than max_length
  3. run tokenized_datasets = datasets.map(tokenize_and_align_labels, batched=True) cell.

cc @sgugger

@oliverguhr, were you able to resolve this issue? Iā€™m about to try something similar and was hoping to snatch some existing code. Thanks for anything you can share!

Well I avoided the problem. I only tokenize 280 words, which will lead to sequences of <512 tokens.

I had the same question, so I landed on this thread, and there is a new feature since March 2023 where people can use stride for long text input to be chunked and connect results from each chunk:

It is possible to use stride in TokenClassification pipeline. In case the input token length exceeds the specified model_max_length, this allows continuation of token classification to the next chunk with overlapping tokens between each chunk with the number of overlapping tokens specified by stride parameter.

This, however, wouldnā€™t solve the original question, and that was fine for me. I trained with model_max_length of 256 and perform inference also with model_max_length of 256. During the training, it is capped at the max length, so for sentences with 257 or more tokens, only the first 256 tokens are part of training. Because in my case sentences with 257 or more tokens were very rare, I could ignore and still achieve the same result. But during inference/prediction time, stride enables processing of all tokens for sentence with more than 256 tokens.

If there is a need to train all sentences with overflowing tokens, I suggest chunking the training sentences with the same stride number, and creating new Dataset object and avoid .map(). I think .map() expects the same number of sentences before and after tokenize_and_align_labels at pyarrow level. This alternative approach has disadvantage because by not using .map(), tokenization isnā€™t parallelized so it gets slower. In my finding, this alternative approach may not be needed at all, hopefully like my case.