According to Huggingface, they pre-train the model using 16 TPU chips total, batch size 256, and the sequence length was limited to 128 tokens for 90% iterations and 512 for the remaining 10%.
I wonder that
- 16 TPU chips is like 16 GPUs?
- The given batch size 256 is total batch size among TPUs?
- “sequence length was limited to 128 for 90% and 512 10% iterations” means the token lengths related weights are reseted after 90% iterations?