Why are segment and position embeddings so large?

Thanks for your reply! I read through the reading group’s thread as well as the Linformer. From what I understand, the biggest problem with projections in large spaces is speed. On the other hand, large, random initialisations perform well out-of-the-box. One would guess, then, that the middle ground is finding trained, smaller dimension feature spaces, leading to a balanced trade-off between speed and performance.

However, there is still a big difference in size with respect to the input between the two examples that I mention. So let’s assume we have a feature with two possible values (e.g. segment IDs, 0 or 1). In onmt this would be encoded (by default) in a space of two values, and one dimension. In BERT, though, it is much larger: two values, but 512 dimensions. What I am interested in is not only the difference between having 1 dimension vs 512, but also how this is motivated in BERT. In BERT (and siblings) there is no constraint between input size of the embedding and its dimensions. 30k vocabulary, 512 positions, 2 segments. All get the same dimensions so they can be summed. I still have not seen any evaluation on this research question that comes down to: is/should the quality of a vector space determined by the size of its keys? The problem to evaluate this, I think, is that in language models these spaces are not trained separately but as part of the whole model. Therefore it is hard to make statements about the embeddings themselves.

As an update about my own research: we found that having a 4-values, 6-dimensions feature, concatenated to a 506 token embedding performs better than summing 4-values, 512-dimensions to a 512-dimension token representation.

1 Like