Run_clm.py: why does the tokenizer phase use so much memory? 288GB for <2GB input data

Using run_clm.py, the tokenizer phase will sometimes consume hundreds of gigs of memory. For example, I’m currently trying to train with 1.9GB of data using GPT2, and the run_clm.py process has been tokenizing for several hours; it currently has 288GB allocated to it, and has consumed about 100GB of physical RAM along with 180GB of swap.

Why does tokenizing 1.9 GB of data require memory 150 times the size of the input data?

I get why you’d need a lot of memory to create the original BPE tokenizer, but surely using an existing tokenizer is just a simple remap of a short string to a number? eg Hello world → “Hello”(9906) " world"(1917)

I also see this behaviour with the llama tokenizer. At some input size, the memory usage is enough that the tokenize phase starts to touch swap, and the whole thing slows to a crawl.

(Edit: Tokenizing 1.9GB was still going when I killed the process at approx 6 hours. Tokenizing a quarter of that size, 0.5GB, takes just over 5 minutes.)

Any ideas? Thanks.

Looking at this again, it seems the model is loaded before the input is tokenized, which may explain why memory use explodes.

I guess one workaround would be to run first with a tiny throwaway model, then run a second with the “actual” model, which will use the cached tokenizer results from the first run.

Got to be a better way, right?