How to tokenize large contexts without running out of memory

I am currently trying to tokenize the Squad-dataset for finetuning the Reformer-model.
The Problem I have is that the Reformer-Model needs a fixed imput-length and the pre-trained one there is needs ca. 500000 tokens. My Idea was to just fuse together the Context of the Wiki-artikles and pad the rest to get to the token threshold. The issue i have run into now is i can only at most tokenize 500 examples at a time or else i run out of RAM (64 Gig). Is there a Possibility to use the Tokenizer on smaller subsets of the Data and then merge the resulting Batch-encodings afterwards?
Would it be possible to just append the attentionmasks and the input_ids?

3 Likes

Hi Samuel,

I am facing a similar issue wherein I am running out of memory while attempting to tokenize 28M datapoints with 64GB ram. Were you able to solve this problem? If yes, could you please let me know how?

Thank you,
Pradyuman

Hi sadly i wasnt able to do it, but I managed to train a reformer model on a smaller Input with a different tokenizer.
If just a little bit of RAM is missing and you are using a Linux based system, you could make a big swap partition to double your RAM, however this will slow down the processing considerably depending on how fast your disk/ssd is.