About pre-training the bert-base-cased model

According to Huggingface, they pre-train the model using 16 TPU chips total, batch size 256, and the sequence length was limited to 128 tokens for 90% iterations and 512 for the remaining 10%.

I wonder that

  1. 16 TPU chips is like 16 GPUs?
  2. The given batch size 256 is total batch size among TPUs?
  3. “sequence length was limited to 128 for 90% and 512 10% iterations” means the token lengths related weights are reseted after 90% iterations?