My opinion is that you should break long data into smaller chunks.
One of the advantages of RoBERTa over BERT comes from the fact that it uses more data. if you throw away all but your first 512 tokens in each document, you will lose the “more data” advantage.
Liu et al created English RoBERTa using DOC-SENTENCES or FULL-SENTENCES regimes, either of which uses most of the words in each document, not just the first 512 tokens.
FULL-SENTENCES: each input is packed with full sentences sampled contiguously from one or more documents […] inputs may cross document boundaries.
DOC-SENTENCES: Inputs are constructed similarly to FULL-SENTENCES, except that they may not cross document boundaries.
I am not an expert, but I’m pretty sure that is correct.
The next two ideas are only speculation:
It might be even better if you could align the start of each chunk with the start of a sentence (but I don’t actually know whether that would make any difference).
Liu et al used 160GB of data. Since you have only 60GB, you might consider sampling your data several times with the splits in different positions. Maybe you could wrap each document into itself (ie once you reach the end of that document, if you haven’t reached a 512-token boundary, start again from the beginning of it.)