How much memory is needed for training ByteLevelBPETokenizer?

Hi, I’m trying to train LM for Japanese from scratch.
To be honest, I copied almost everything from https://github.com/huggingface/blog/blob/master/notebooks/01_how_to_train.ipynb.
I just changed datasets from Esperanto datasets to Japanese wiki datasets.

But when I tried to train tokenizer, my notebook crashed and restarted probably because of out of memory. My datasets is an entire wikipedia text which is 5.1 G. But my server had 64G memory.

How much memory do I need to train tokenizer from scratch?
or can I prevent out of memory error with some options?

Thanks in advance.

1 Like

Pinging @anthony

[UPDATE] I tried subset of entire dataset.

66,101,891 lines, 5.1 G ( entire datasets ) : out of memory
15,000,000 lines: out of memory
3,000,000 lines, 231M: success

Can I use subset to train tokenizer and use it to train LM with entire datasets?

I could train a tokenizer with train_bert_wordpiece.py without errors.
So it could be jupyter notebooks or something else which caused out of memory errors.