Why are segment and position embeddings so large?

BramVanroy · August 2, 2020, 11:59am

Thanks for your reply! I read through the reading group’s thread as well as the Linformer. From what I understand, the biggest problem with projections in large spaces is speed. On the other hand, large, random initialisations perform well out-of-the-box. One would guess, then, that the middle ground is finding trained, smaller dimension feature spaces, leading to a balanced trade-off between speed and performance.

However, there is still a big difference in size with respect to the input between the two examples that I mention. So let’s assume we have a feature with two possible values (e.g. segment IDs, 0 or 1). In onmt this would be encoded (by default) in a space of two values, and one dimension. In BERT, though, it is much larger: two values, but 512 dimensions. What I am interested in is not only the difference between having 1 dimension vs 512, but also how this is motivated in BERT. In BERT (and siblings) there is no constraint between input size of the embedding and its dimensions. 30k vocabulary, 512 positions, 2 segments. All get the same dimensions so they can be summed. I still have not seen any evaluation on this research question that comes down to: is/should the quality of a vector space determined by the size of its keys? The problem to evaluate this, I think, is that in language models these spaces are not trained separately but as part of the whole model. Therefore it is hard to make statements about the embeddings themselves.

As an update about my own research: we found that having a 4-values, 6-dimensions feature, concatenated to a 506 token embedding performs better than summing 4-values, 512-dimensions to a 512-dimension token representation.

Topic		Replies	Views
Reduce the number of features of BERT embeddings Beginners	2	7360	August 31, 2021
How to use transformer attention model when the input is features Beginners	1	1264	October 12, 2020
How to load a BERT model with 1024 dimensions Beginners	0	2911	June 9, 2021
Not fixed dimension, but attention to learn 🤗Transformers	0	183	April 17, 2023
Good way to output embedding for search? Beginners	3	1599	August 19, 2020

Why are segment and position embeddings so large?

Related topics