Tokenizer.train() running out of memory

I am trying to train a tokenizer using the following code:

tokenizer= ByteLevelBPETokenizer()


tokenizer.train(files=paths, vocab_size=52_000, min_frequency=2, special_tokens=[
    "<s>",
    "<pad>",
    "</s>",
    "<unk>",
    "<mask>",
])

# Save files to disk
tokenizer.save_model("tokenizer")

It works fine when using a small dataset but when using my full dataset it reads the following error:

[00:00:00] Pre-processing files (485 Mo) β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ 100%
memory allocation of 21474836480 bytes failed

My system has 16gb of RAM. Is there a way around this issue that isn’t upgrading RAM? I’m not finding solutions online. Thanks