Sliding Window - Multilabel Classification

I have multiple text fields I’d like to concatenate into 1 transformer input. However, doing so will push me over the token limit for some records (4096).

Are there any examples or documentation that cover how to handle the predictions before evaluation, in a multi label classification approach where the input text is chunked with a sliding window? I can’t seem to find any.

Because I want to incorporate the entire document and all possible prediction tags in the document from each chunk, I think I would need to somehow link the predictions from all chunks back to the same record before evaluation and backprop, but I see a few issues with this:

-The batch size would need to be large enough to incorporate all chunks from 1 record to do this or it wouldn’t be able to reaggregate labels.

-I am still “truncating” context. I could potentially misclassify on a per-chunk basis. I.e. incorrect prediction for a label in chunk 2 if there is some informing context from chunk 1 that doesn’t get picked up outside the overlap/stride. So simply taking the unique set of both chunk predictions could be incorrect because it doesn’t include this “out of stride” context. In other words, by not taking the mean of the logits - I am destroying the intra-chunk context I hope to incorporate. But if I take the mean, I may leave out potential labels that don’t appear in both chunks.

How can I combat this? A larger stride? Perhaps a different multifield approach than concatenating all useful fields together into one input? I was thinking maybe the new LongNet might be better suited for this.