Sequence Classification Long Documents

sallu · March 31, 2021, 11:43am

Hi.
I need to process very long documents on TPUs and be able to use any pre-trained models except LongFormer. There are multiple approaches to this but I am interested in the sliding window approach. I am able to tokenize the documents into sub-documents using stride option and return_overflowing_tokens=True, but I have no clue how to incorporate them into training. I don’t want each sub-document to behave as a separate document. Rather, the activations of sub-documents should be combined first by any approach (mean for instance) and then the softmax should be applied and loss calculated.

Is there a way to achieve this using HuggingFace Trainer / Training Arguments? Any working example that is doing the same thing using transformers library?

Thanks

davibarreira · June 9, 2022, 3:41pm

Hey, @sallu , have you figured it out? I’m with the same issue.

Topic		Replies	Views
Text classification training on long text Intermediate	3	4946	June 18, 2024
Token Classification Models on (Very) Long Text Models	8	11146	March 9, 2023
Sliding Transformer into a long sequence Models	3	663	August 20, 2022
Pretrained model with stride doesn't predict long text Beginners	1	348	December 1, 2023
How to get Word Embeddings for Sentences/Documents using long-former model? Beginners	1	4143	October 8, 2022

Sequence Classification Long Documents

Related topics