I am trying to train a tokenizer using the following code:
tokenizer= ByteLevelBPETokenizer()
tokenizer.train(files=paths, vocab_size=52_000, min_frequency=2, special_tokens=[
"<s>",
"<pad>",
"</s>",
"<unk>",
"<mask>",
])
# Save files to disk
tokenizer.save_model("tokenizer")
It works fine when using a small dataset but when using my full dataset it reads the following error:
[00:00:00] Pre-processing files (485 Mo) ββββββββββββββββββββββββββββββββββββββββββββββββ 100%
memory allocation of 21474836480 bytes failed
My system has 16gb of RAM. Is there a way around this issue that isnβt upgrading RAM? Iβm not finding solutions online. Thanks