It’s actually more a question of projecting in a high-dimensionality dense vector space versus a sparse space rather than the dimensionality it-self.
A lot of the recent developments in NLP are about projecting labels and tabular data in a high-dim vector space (assigning learned vectors to spare categorical features) prior to computation.
One striking demonstration of the efficiency of casting in high-dimension is in the work of John Wieting and Douwe Kiela: https://openreview.net/forum?id=BkgPajAcY7 but there is also a much older history of work on random projections and the Johnson-Lindenstrauss lemma: https://scikit-learn.org/stable/modules/random_projection.html A related discussion on the JL lemma you may want to join is here: https://github.com/huggingface/awesome-papers/discussions/7
Note however that there is a limit in the optimal dimension for the input embedding and recent models like ALBERT (https://openreview.net/forum?id=H1eA7AEtvS) or approach like Adaptive inputs (http://arxiv.org/abs/1809.10853) keep the input dimension smaller the models hidden-size to reach more optimal ratio between both of these dimensions.