Hi, I’m trying to train LM for Japanese from scratch.
To be honest, I copied almost everything from https://github.com/huggingface/blog/blob/master/notebooks/01_how_to_train.ipynb.
I just changed datasets from Esperanto datasets to Japanese wiki datasets.
But when I tried to train tokenizer, my notebook crashed and restarted probably because of out of memory. My datasets is an entire wikipedia text which is 5.1 G. But my server had 64G memory.
How much memory do I need to train tokenizer from scratch?
or can I prevent out of memory error with some options?
Thanks in advance.