XLNet recurrence mechanism on long sequences


Can anyone tell me what happens when we feed an input sequence of more than 512 tokens to an XLNet model?
I think that the model is currently applied to the whole sequence as is (without any chunking)…

From the paper, I understood that the implementation will “internally” chunk the sequence and apply the model on each chunk (and re-use the cached hidden states of the previous chunk).

Is this actually the case? Or do I need to chunk the sequence beforehand (my-self) and feed each chunk to the model along with the cached hidden states from the previous chunk (mems)?

When I check the implementation, I see that the attention layer is applied directly to the whole input sequence. Only the feed forward layer is applied to each chunk separately but it seems that there is only one chunk even for long sequences when using with default settings.

Thanks !