Hello Huggingface,
I try to solve a token classification task where the documents are longer than the modelās max length.
I modified the tokenize_and_align_labels function from example token classification notebook. I set the tokenizer option return_overflowing_tokens=True and rewrote the function to map labels for the overflowing tokens:
@oliverguhr, were you able to resolve this issue? Iām about to try something similar and was hoping to snatch some existing code. Thanks for anything you can share!
I had the same question, so I landed on this thread, and there is a new feature since March 2023 where people can use stride for long text input to be chunked and connect results from each chunk:
It is possible to use stride in TokenClassification pipeline. In case the input token length exceeds the specified model_max_length, this allows continuation of token classification to the next chunk with overlapping tokens between each chunk with the number of overlapping tokens specified by stride parameter.
This, however, wouldnāt solve the original question, and that was fine for me. I trained with model_max_length of 256 and perform inference also with model_max_length of 256. During the training, it is capped at the max length, so for sentences with 257 or more tokens, only the first 256 tokens are part of training. Because in my case sentences with 257 or more tokens were very rare, I could ignore and still achieve the same result. But during inference/prediction time, stride enables processing of all tokens for sentence with more than 256 tokens.
If there is a need to train all sentences with overflowing tokens, I suggest chunking the training sentences with the same stride number, and creating new Dataset object and avoid .map(). I think .map() expects the same number of sentences before and after tokenize_and_align_labels at pyarrow level. This alternative approach has disadvantage because by not using .map(), tokenization isnāt parallelized so it gets slower. In my finding, this alternative approach may not be needed at all, hopefully like my case.