Token Classification: How to tokenize and align labels with overflow and stride?

oliverguhr · March 12, 2021, 1:23pm

Hello Huggingface,
I try to solve a token classification task where the documents are longer than the model’s max length.
I modified the tokenize_and_align_labels function from example token classification notebook. I set the tokenizer option return_overflowing_tokens=True and rewrote the function to map labels for the overflowing tokens:

tokenizer_settings = {'is_split_into_words':True,'return_offsets_mapping':True, 
                        'padding':True, 'truncation':True, 'stride':0, 
                        'max_length':tokenizer.model_max_length, 'return_overflowing_tokens':True}

def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(examples["tokens"], **tokenizer_settings)

    labels = []
    for i,document in enumerate(tokenized_inputs.encodings):
       doc_encoded_labels = []
       last_word_id = None
       for word_id  in document.word_ids:            
           if word_id == None: #or last_word_id == word_id:
               doc_encoded_labels.append(-100)        
           else:
               document_id = tokenized_inputs.overflow_to_sample_mapping[i]
               label = examples[task][document_id][word_id]               
               doc_encoded_labels.append(int(label))
           last_word_id = word_id
       labels.append(doc_encoded_labels)
    
    tokenized_inputs["labels"] = labels    
    return tokenized_inputs

Executing this code will result in an error:

exception has occurred: ArrowInvalid
Column 5 named task1 expected length 820 but got length 30

It looks like the input 30 examples can’t be mapped to the 820 examples after the slicing. How can I solve this issue?

Environment info

Google Colab running this notbook

To reproduce

Steps to reproduce the behaviour:

Replace the tokenize_and_align_labels function with the function given above.
Add examples longer than max_length
run tokenized_datasets = datasets.map(tokenize_and_align_labels, batched=True) cell.

nielsr · March 15, 2021, 11:25am

cc @sgugger

jstremme · March 24, 2022, 6:01pm

@oliverguhr, were you able to resolve this issue? I’m about to try something similar and was hoping to snatch some existing code. Thanks for anything you can share!

oliverguhr · September 13, 2022, 9:59am

Well I avoided the problem. I only tokenize 280 words, which will lead to sequences of <512 tokens.

A415Hz · July 22, 2024, 12:04pm

I had the same question, so I landed on this thread, and there is a new feature since March 2023 where people can use stride for long text input to be chunked and connect results from each chunk:

github.com/huggingface/transformers

Chunkable token classification pipeline

huggingface:main ← luccailliau:patch-1

opened 11:14PM - 23 Feb 23 UTC

luccailliau

+234 -43

This PR improve the TokenClassificationPipeline by extending its usage to tokeni…zed texts longer than `model_max_length` by returning overflowing tokens as chunks rather than truncating texts. To enable the use of this extended feature, you must use a fast tokenizer with an aggregation strategy different to `"none"` and provide a `stride` number. # What does this PR do?  Fixes # (issue) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [x] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [x] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.

It is possible to use stride in TokenClassification pipeline. In case the input token length exceeds the specified model_max_length, this allows continuation of token classification to the next chunk with overlapping tokens between each chunk with the number of overlapping tokens specified by stride parameter.

This, however, wouldn’t solve the original question, and that was fine for me. I trained with model_max_length of 256 and perform inference also with model_max_length of 256. During the training, it is capped at the max length, so for sentences with 257 or more tokens, only the first 256 tokens are part of training. Because in my case sentences with 257 or more tokens were very rare, I could ignore and still achieve the same result. But during inference/prediction time, stride enables processing of all tokens for sentence with more than 256 tokens.

If there is a need to train all sentences with overflowing tokens, I suggest chunking the training sentences with the same stride number, and creating new Dataset object and avoid .map(). I think .map() expects the same number of sentences before and after tokenize_and_align_labels at pyarrow level. This alternative approach has disadvantage because by not using .map(), tokenization isn’t parallelized so it gets slower. In my finding, this alternative approach may not be needed at all, hopefully like my case.

Topic		Replies	Views
NER Label tokenization with overflowing tokens 🤗Tokenizers	4	1433	November 6, 2023
Handle overflowing tokens Beginners	0	125	May 29, 2024
ArrowInvalid: Column 3 named attention_mask expected length 1000 but got length 1076 🤗Tokenizers	3	2514	July 26, 2023
The 🤗 Datasets library - Hugging Face Course 🤗Datasets	1	568	November 25, 2021
Predicting with Token Classifier on data with no gold labels Beginners	1	1432	August 20, 2021

Token Classification: How to tokenize and align labels with overflow and stride?

Environment info

To reproduce

Related topics