How to tokenize large contexts without running out of memory

samuel · April 30, 2021, 11:19am

I am currently trying to tokenize the Squad-dataset for finetuning the Reformer-model.
The Problem I have is that the Reformer-Model needs a fixed imput-length and the pre-trained one there is needs ca. 500000 tokens. My Idea was to just fuse together the Context of the Wiki-artikles and pad the rest to get to the token threshold. The issue i have run into now is i can only at most tokenize 500 examples at a time or else i run out of RAM (64 Gig). Is there a Possibility to use the Tokenizer on smaller subsets of the Data and then merge the resulting Batch-encodings afterwards?
Would it be possible to just append the attentionmasks and the input_ids?

Jiraya · August 8, 2022, 1:11am

Hi Samuel,

I am facing a similar issue wherein I am running out of memory while attempting to tokenize 28M datapoints with 64GB ram. Were you able to solve this problem? If yes, could you please let me know how?

Thank you,
Pradyuman

samuel · August 8, 2022, 8:26am

Hi sadly i wasnt able to do it, but I managed to train a reformer model on a smaller Input with a different tokenizer.
If just a little bit of RAM is missing and you are using a Linux based system, you could make a big swap partition to double your RAM, however this will slow down the processing considerably depending on how fast your disk/ssd is.

Topic		Replies	Views
Tokenizer.train() running out of memory 🤗Tokenizers	0	750	February 9, 2023
Training tokenizer takes too much RAM 🤗Tokenizers	1	1320	February 21, 2022
Training a reformer from scratch Beginners	5	1495	July 20, 2021
Training RoBERTa on a large corpus 🤗Transformers	5	3341	August 25, 2020
Tokenizer Trainer Crashing 🤗Tokenizers	0	700	April 15, 2023

How to tokenize large contexts without running out of memory

Related topics