Tokenizer.train() running out of memory

I am trying to train a tokenizer using the following code:

tokenizer= ByteLevelBPETokenizer()

tokenizer.train(files=paths, vocab_size=52_000, min_frequency=2, special_tokens=[

# Save files to disk

It works fine when using a small dataset but when using my full dataset it reads the following error:

[00:00:00] Pre-processing files (485 Mo) β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ 100%
memory allocation of 21474836480 bytes failed

My system has 16gb of RAM. Is there a way around this issue that isn’t upgrading RAM? I’m not finding solutions online. Thanks