Hello again,
I am fairly sure you are right, and you would have to train a model from scratch if you want to alter the layer size.
I believe you could increase the width of the model by using more attention heads in each block, or by changing the hidden size, or both. For example, bert-large is 24-layer, 1024-hidden, 16-heads per block, 340M parameters. (bert-base is 12 heads per block) .
I think the hidden size corresponds to the number of real numbers used to represent each token, so I think you would need to train a new embedding layer if you changed the hidden-layer size.