About pre-training the bert-base-cased model

pkr7098 · October 17, 2023, 5:06am

According to Huggingface, they pre-train the model using 16 TPU chips total, batch size 256, and the sequence length was limited to 128 tokens for 90% iterations and 512 for the remaining 10%.

I wonder that

16 TPU chips is like 16 GPUs?
The given batch size 256 is total batch size among TPUs?
“sequence length was limited to 128 for 90% and 512 10% iterations” means the token lengths related weights are reseted after 90% iterations?

Topic		Replies	Views
How to fine-tune BERT on 1 million+ sentences on Kaggle? (Sequence Regression) Beginners	0	320	December 24, 2022
Trainer) training one batch with multiple GPUs DeepSpeed	0	389	June 19, 2023
Huggingface using only half of the cores for inference Intermediate	0	517	September 6, 2023
Not using GPU although it is specified Course	5	30982	December 30, 2024
Out of memory when fine-tuning bert on tpu 🤗Transformers	0	605	December 2, 2021

About pre-training the bert-base-cased model

Related topics