Do We Still Need Dimensionality Reduction for LLM Text Embeddings?

Hello @phamnam,

I am fairly new to the world of NLP and even AI, so I apologize if my ideas are entirely ungrounded. Your findings were super interesting and I couldn’t help but want to discuss them :slight_smile:

Low Intrinsic Dimension

Perhaps the information stored in the embedding vectors reside in a low intrinsic dimension. In this case, there might exist information overlap across embedding model dimensions. Truncation might be working well because some of the information that was truncated is also present in other dimensions.

Perhaps it has to do with the formula used for vector comparison.

For example, one common metric used for vector comparison is cosine similarity, which has the following formula:

When you truncate a vector, you are impacting the formula in a few different ways.

  1. You are decreasing the value of dot_product(A, B)
  2. You are decreasing the value of len(A)
  3. You are decreasing the value of len(B)

Perhaps it is the case that since you are decreasing both the top and bottom halves of the division, you ultimately get a cosine similarity value that is pretty similar to what you would have gotten before truncation. Ultimately, this would lead to fairly similar search results.