By constraining model capacity with a Matryoshka wrapper, the modeling task becomes more difficult, and this theoretically improves the capability per parameter.
We are using a cross-encoder to train a Transformer model of the task of Semantic Similarity of Health Care concepts.
Has anyone explored the potential benefit of successive Matryoshka training with an increasingly less coarse Matryoshka function ?
THe idea would be to use the initial runs to provide an improved underlying structure, for the later runs to build upon. ( more similar to the way a sculptor works ,… first chiseling out a rough form before further honing)
Thank You