Do We Still Need Dimensionality Reduction for LLM Text Embeddings?

Chahnwoo · August 20, 2024, 2:50am

I am fairly new to the world of NLP and even AI, so I apologize if my ideas are entirely ungrounded. Your findings were super interesting and I couldn’t help but want to discuss them

Low Intrinsic Dimension

Perhaps the information stored in the embedding vectors reside in a low intrinsic dimension. In this case, there might exist information overlap across embedding model dimensions. Truncation might be working well because some of the information that was truncated is also present in other dimensions.

Perhaps it has to do with the formula used for vector comparison.

For example, one common metric used for vector comparison is cosine similarity, which has the following formula:

When you truncate a vector, you are impacting the formula in a few different ways.

You are decreasing the value of dot_product(A, B)
You are decreasing the value of len(A)
You are decreasing the value of len(B)

Perhaps it is the case that since you are decreasing both the top and bottom halves of the division, you ultimately get a cosine similarity value that is pretty similar to what you would have gotten before truncation. Ultimately, this would lead to fairly similar search results.

Topic		Replies	Views
Low Dim Embeddings from Similarity Transformer Models Beginners	1	655	April 5, 2024
Embeddig model information Beginners	6	165	October 20, 2024
Reduce the number of features of BERT embeddings Beginners	2	7310	August 31, 2021
LLM and different embeddings interaction Beginners	0	661	October 17, 2023
Smaller embedding size causes lower loss 🤗Transformers	0	320	July 23, 2022

Do We Still Need Dimensionality Reduction for LLM Text Embeddings?

Low Intrinsic Dimension

Perhaps it has to do with the formula used for vector comparison.

Related topics