Truncation_strategy for very long string

Hello everyone,
I am working on a multi-class text classification model. After experimenting with sklearn algorithms and fasttext, I started with BERT. I want to explore BERT and ALBERT. Most of the documents that I have to classify have more than 512 after tokens. I am passing the entire document as a single string and the second half of the string is relatively more important than the first half. It is not clear if truncation_strategy = ’only_first’ removes the first part of the tokenized string till the length reached 512 or is this only applicable if I pass a text pair?

Also, this paper suggests head only, tail only and head plus tail truncation strategies. If I want to implement this, does truncation_strategy = ’only_first’ correspond to head only and truncation_strategy = ’only_second’ correspond to tail only ? If yes, how do I implement both head and tail strategy.

Regarding the stride ( int , optional, defaults to 0 ) – If set to a number along with max_length, the overflowing tokens returned will contain some tokens from the main sequence returned. The value of this argument defines the number of additional tokens.
If I have 1000 token sequence, and if I give the stride as 500, will it retain the overflowing tokens? Kindly explain stride option for me.

Longformer has a token limit of 4096. I want to explore this as well. Feel free to share your experience with long text classification.

Thank you.