Do We Still Need Dimensionality Reduction for LLM Text Embeddings?

The current MTEB Leaderboard is dominated by LLM-based text embedding models, demonstrating their effectiveness in this field. However, using these embeddings in real-world projects can be expensive due to their high dimensionality (often 4096, 3584, or even larger).

Recently, I’ve been experimenting with dimensionality reduction techniques for LLM text embeddings, motivated by the desire for greater efficiency. I explored methods inspired by two papers: “Matryoshka Representation Learning” and “Espresso Sentence Embeddings”.

However, I stumbled upon a surprising discovery due to a bug in my code. It turns out that simple truncation (or pruning) of the embedding vector based on position yields comparable results to using the full-size vector!

  • Truncation/pruning can be applied to select the first X dimensions, the last X dimensions, a segment from the middle, or even elements at arbitrary positions within the vector.

I tested this approach with various models, including a Vistral Text embedding model (fine-tuned from Vistral 7B Chat), gte-qwen2-1.5b-instruct, and multilingual BERT, and all showed similar results.

This finding has left me bewildered. Why is this happening? Could it be that the information is so evenly distributed within the vector that truncation/pruning has little impact compared to the full-size representation?

Does this mean that sophisticated dimensionality reduction algorithms and techniques are no longer necessary?

I’m eager to hear your thoughts and insights on this unexpected observation. Please share your opinions in the comments!

Hello @phamnam,

I am fairly new to the world of NLP and even AI, so I apologize if my ideas are entirely ungrounded. Your findings were super interesting and I couldn’t help but want to discuss them :slight_smile:

Low Intrinsic Dimension

Perhaps the information stored in the embedding vectors reside in a low intrinsic dimension. In this case, there might exist information overlap across embedding model dimensions. Truncation might be working well because some of the information that was truncated is also present in other dimensions.

Perhaps it has to do with the formula used for vector comparison.

For example, one common metric used for vector comparison is cosine similarity, which has the following formula:

When you truncate a vector, you are impacting the formula in a few different ways.

  1. You are decreasing the value of dot_product(A, B)
  2. You are decreasing the value of len(A)
  3. You are decreasing the value of len(B)

Perhaps it is the case that since you are decreasing both the top and bottom halves of the division, you ultimately get a cosine similarity value that is pretty similar to what you would have gotten before truncation. Ultimately, this would lead to fairly similar search results.