Why are segment and position embeddings so large?

thomwolf · July 29, 2020, 9:29am

It’s actually more a question of projecting in a high-dimensionality dense vector space versus a sparse space rather than the dimensionality it-self.

A lot of the recent developments in NLP are about projecting labels and tabular data in a high-dim vector space (assigning learned vectors to spare categorical features) prior to computation.

One striking demonstration of the efficiency of casting in high-dimension is in the work of John Wieting and Douwe Kiela: https://openreview.net/forum?id=BkgPajAcY7 but there is also a much older history of work on random projections and the Johnson-Lindenstrauss lemma: https://scikit-learn.org/stable/modules/random_projection.html A related discussion on the JL lemma you may want to join is here: https://github.com/huggingface/awesome-papers/discussions/7

Note however that there is a limit in the optimal dimension for the input embedding and recent models like ALBERT (https://openreview.net/forum?id=H1eA7AEtvS) or approach like Adaptive inputs (http://arxiv.org/abs/1809.10853) keep the input dimension smaller the models hidden-size to reach more optimal ratio between both of these dimensions.

Topic		Replies	Views
Not fixed dimension, but attention to learn 🤗Transformers	0	183	April 17, 2023
Reduce the number of features of BERT embeddings Beginners	2	7360	August 31, 2021
Embeddig model information Beginners	6	234	October 20, 2024
How the hugging face embeddings size is too low? Beginners	0	239	December 31, 2023
Getting pretrained embeddings 🤗Transformers	0	609	June 20, 2023

Why are segment and position embeddings so large?

Related topics