Exploring Dual-Head Embeddings and Adaptive Compression Best Practices?

I’m experimenting with a dual-head embedding architecture (one semantic head for contextual meaning, one entity head for precise term resolution) and want to preserve semantic consistency after pruning or matryoshka-style compression.

Are there evaluation metrics or validation strategies beyond cosine similarity that you’ve found reliable for detecting information loss in such setups? Any insights on training tricks (e.g., InfoNCE + VICReg blends or alternative regularizers) that help maintain performance across heads during compression would be greatly appreciated.

1 Like

For now, resources.