Why are segment and position embeddings so large?

Cross-post from: https://forum.opennmt.net/t/size-of-feature-embeddings/3836

These days I am part-time doing work on improving translation models. We are working with regular transformer seq2seq networks using OpenNMT. This question is not about OpenNMT but it was triggered by going through its documentation. In onmt one can add features to each word. These features are then used to train their own embedding. For example, if you want to train a lower case model but still want to give importance to casing, you can add a casing feature that indicates whether the word was lower case or not.

i│C like│l cookies│l from│l new│C york│C

This will create two embedding layers under the hood. One for the tokens, and one for the case features.

In their documentation, they state that the default size for features is

… set to N^feat_vec_exponent where N is the number of values the feature takes.

where the default feat_vec_exponent value is 0.7.

However, that means that for two features, they would only get a size of 1 or 2 (1.6). The embeddings (token and casing) are then concatenated. This contrasts sharply with the language models that I know. Take for instance, BERT, which has token (30k values), segment (two values), and position (512 values) which all have 512 dimensions, even the segment embeddings. These embeddings are summed.

My question thus ends up being: I always thought that the number of items in the embedding should more or less dictate the hidden size of that embedding (as onmt suggests), but BERT and siblings do not do this. So what is the best way, and why? How come that only two features in a 512 dimension space make sense?

1 Like

It’s actually more a question of projecting in a high-dimensionality dense vector space versus a sparse space rather than the dimensionality it-self.

A lot of the recent developments in NLP are about projecting labels and tabular data in a high-dim vector space (assigning learned vectors to spare categorical features) prior to computation.

One striking demonstration of the efficiency of casting in high-dimension is in the work of John Wieting and Douwe Kiela: https://openreview.net/forum?id=BkgPajAcY7 but there is also a much older history of work on random projections and the Johnson-Lindenstrauss lemma: https://scikit-learn.org/stable/modules/random_projection.html A related discussion on the JL lemma you may want to join is here: https://github.com/huggingface/awesome-papers/discussions/7

Note however that there is a limit in the optimal dimension for the input embedding and recent models like ALBERT (https://openreview.net/forum?id=H1eA7AEtvS) or approach like Adaptive inputs (http://arxiv.org/abs/1809.10853) keep the input dimension smaller the models hidden-size to reach more optimal ratio between both of these dimensions.

1 Like

Thanks for your reply! I read through the reading group’s thread as well as the Linformer. From what I understand, the biggest problem with projections in large spaces is speed. On the other hand, large, random initialisations perform well out-of-the-box. One would guess, then, that the middle ground is finding trained, smaller dimension feature spaces, leading to a balanced trade-off between speed and performance.

However, there is still a big difference in size with respect to the input between the two examples that I mention. So let’s assume we have a feature with two possible values (e.g. segment IDs, 0 or 1). In onmt this would be encoded (by default) in a space of two values, and one dimension. In BERT, though, it is much larger: two values, but 512 dimensions. What I am interested in is not only the difference between having 1 dimension vs 512, but also how this is motivated in BERT. In BERT (and siblings) there is no constraint between input size of the embedding and its dimensions. 30k vocabulary, 512 positions, 2 segments. All get the same dimensions so they can be summed. I still have not seen any evaluation on this research question that comes down to: is/should the quality of a vector space determined by the size of its keys? The problem to evaluate this, I think, is that in language models these spaces are not trained separately but as part of the whole model. Therefore it is hard to make statements about the embeddings themselves.

As an update about my own research: we found that having a 4-values, 6-dimensions feature, concatenated to a 506 token embedding performs better than summing 4-values, 512-dimensions to a 512-dimension token representation.

1 Like