Run_clm.py: why does the tokenizer phase use so much memory? 288GB for <2GB input data

rowan250 · February 3, 2024, 1:07pm

Using run_clm.py, the tokenizer phase will sometimes consume hundreds of gigs of memory. For example, I’m currently trying to train with 1.9GB of data using GPT2, and the run_clm.py process has been tokenizing for several hours; it currently has 288GB allocated to it, and has consumed about 100GB of physical RAM along with 180GB of swap.

Why does tokenizing 1.9 GB of data require memory 150 times the size of the input data?

I get why you’d need a lot of memory to create the original BPE tokenizer, but surely using an existing tokenizer is just a simple remap of a short string to a number? eg Hello world → “Hello”(9906) " world"(1917)

I also see this behaviour with the llama tokenizer. At some input size, the memory usage is enough that the tokenize phase starts to touch swap, and the whole thing slows to a crawl.

(Edit: Tokenizing 1.9GB was still going when I killed the process at approx 6 hours. Tokenizing a quarter of that size, 0.5GB, takes just over 5 minutes.)

Any ideas? Thanks.

rowan250 · February 4, 2024, 4:39am

Looking at this again, it seems the model is loaded before the input is tokenized, which may explain why memory use explodes.

I guess one workaround would be to run first with a tiny throwaway model, then run a second with the “actual” model, which will use the cached tokenizer results from the first run.

Got to be a better way, right?

Topic		Replies	Views
How to deal with tokenizer out of memory in run_clm.py Beginners	0	302	March 22, 2023
Running out of Memory with run_clm.py Beginners	3	1681	December 14, 2022
Training tokenizer takes too much RAM 🤗Tokenizers	1	1320	February 21, 2022
Tokenizer.train() running out of memory 🤗Tokenizers	0	750	February 9, 2023
Clm repeats tokenization when distributed Intermediate	5	1310	July 15, 2022

Run_clm.py: why does the tokenizer phase use so much memory? 288GB for <2GB input data

Related topics