How to tokenize large contexts without running out of memory

I am currently trying to tokenize the Squad-dataset for finetuning the Reformer-model.
The Problem I have is that the Reformer-Model needs a fixed imput-length and the pre-trained one there is needs ca. 500000 tokens. My Idea was to just fuse together the Context of the Wiki-artikles and pad the rest to get to the token threshold. The issue i have run into now is i can only at most tokenize 500 examples at a time or else i run out of RAM (64 Gig). Is there a Possibility to use the Tokenizer on smaller subsets of the Data and then merge the resulting Batch-encodings afterwards?
Would it be possible to just append the attentionmasks and the input_ids?

1 Like