I need to process very long documents on TPUs and be able to use any pre-trained models except LongFormer. There are multiple approaches to this but I am interested in the sliding window approach. I am able to tokenize the documents into sub-documents using stride option and return_overflowing_tokens=True, but I have no clue how to incorporate them into training. I don’t want each sub-document to behave as a separate document. Rather, the activations of sub-documents should be combined first by any approach (mean for instance) and then the softmax should be applied and loss calculated.
Is there a way to achieve this using HuggingFace Trainer / Training Arguments? Any working example that is doing the same thing using transformers library?