Using run_clm.py, the tokenizer phase will sometimes consume hundreds of gigs of memory. For example, I’m currently trying to train with 1.9GB of data using GPT2, and the run_clm.py process has been tokenizing for several hours; it currently has 288GB allocated to it, and has consumed about 100GB of physical RAM along with 180GB of swap.
Why does tokenizing 1.9 GB of data require memory 150 times the size of the input data?
I get why you’d need a lot of memory to create the original BPE tokenizer, but surely using an existing tokenizer is just a simple remap of a short string to a number? eg Hello world → “Hello”(9906) " world"(1917)
I also see this behaviour with the llama tokenizer. At some input size, the memory usage is enough that the tokenize phase starts to touch swap, and the whole thing slows to a crawl.
(Edit: Tokenizing 1.9GB was still going when I killed the process at approx 6 hours. Tokenizing a quarter of that size, 0.5GB, takes just over 5 minutes.)
Any ideas? Thanks.