How much memory is needed for training ByteLevelBPETokenizer?

kouohhashi · September 16, 2020, 1:10pm

Hi, I’m trying to train LM for Japanese from scratch.
To be honest, I copied almost everything from https://github.com/huggingface/blog/blob/master/notebooks/01_how_to_train.ipynb.
I just changed datasets from Esperanto datasets to Japanese wiki datasets.

But when I tried to train tokenizer, my notebook crashed and restarted probably because of out of memory. My datasets is an entire wikipedia text which is 5.1 G. But my server had 64G memory.

How much memory do I need to train tokenizer from scratch?
or can I prevent out of memory error with some options?

Thanks in advance.

valhalla · September 16, 2020, 1:26pm

Pinging @anthony

kouohhashi · September 17, 2020, 12:15am

[UPDATE] I tried subset of entire dataset.

66,101,891 lines, 5.1 G ( entire datasets ) : out of memory
15,000,000 lines: out of memory
3,000,000 lines, 231M: success

Can I use subset to train tokenizer and use it to train LM with entire datasets?

kouohhashi · September 18, 2020, 5:49am

I could train a tokenizer with train_bert_wordpiece.py without errors.
So it could be jupyter notebooks or something else which caused out of memory errors.

Topic		Replies	Views
Tokenizer.train() running out of memory 🤗Tokenizers	0	750	February 9, 2023
Training tokenizer takes too much RAM 🤗Tokenizers	1	1320	February 21, 2022
Tokenizer Trainer Crashing 🤗Tokenizers	0	701	April 15, 2023
Train wordpiece from scratch 🤗Tokenizers	2	1436	September 9, 2021
Huggingface distilbert-base-uncased-finetuned-sst-2-english runs out of ram with only a few kb? Beginners	0	373	May 12, 2022

How much memory is needed for training ByteLevelBPETokenizer?

Related topics