Token Classification: How to tokenize and align labels with overflow and stride?

Hello Huggingface,
I try to solve a token classification task where the documents are longer than the model’s max length.
I modified the tokenize_and_align_labels function from example token classification notebook. I set the tokenizer option return_overflowing_tokens=True and rewrote the function to map labels for the overflowing tokens:

tokenizer_settings = {'is_split_into_words':True,'return_offsets_mapping':True, 
                        'padding':True, 'truncation':True, 'stride':0, 
                        'max_length':tokenizer.model_max_length, 'return_overflowing_tokens':True}

def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(examples["tokens"], **tokenizer_settings)

    labels = []
    for i,document in enumerate(tokenized_inputs.encodings):
       doc_encoded_labels = []
       last_word_id = None
       for word_id  in document.word_ids:            
           if word_id == None: #or last_word_id == word_id:
               doc_encoded_labels.append(-100)        
           else:
               document_id = tokenized_inputs.overflow_to_sample_mapping[i]
               label = examples[task][document_id][word_id]               
               doc_encoded_labels.append(int(label))
           last_word_id = word_id
       labels.append(doc_encoded_labels)
    
    tokenized_inputs["labels"] = labels    
    return tokenized_inputs

Executing this code will result in an error:

exception has occurred: ArrowInvalid
Column 5 named task1 expected length 820 but got length 30

It looks like the input 30 examples can’t be mapped to the 820 examples after the slicing. How can I solve this issue?

Environment info

Google Colab running this notbook

To reproduce

Steps to reproduce the behaviour:

  1. Replace the tokenize_and_align_labels function with the function given above.
  2. Add examples longer than max_length
  3. run tokenized_datasets = datasets.map(tokenize_and_align_labels, batched=True) cell.

cc @sgugger

@oliverguhr, were you able to resolve this issue? I’m about to try something similar and was hoping to snatch some existing code. Thanks for anything you can share!

Well I avoided the problem. I only tokenize 280 words, which will lead to sequences of <512 tokens.