Successive Matryoshka training - Healthcare concepts

By constraining model capacity with a Matryoshka wrapper, the modeling task becomes more difficult, and this theoretically improves the capability per parameter.

We are using a cross-encoder to train a Transformer model of the task of Semantic Similarity of Health Care concepts.

Has anyone explored the potential benefit of successive Matryoshka training with an increasingly less coarse Matryoshka function ?

THe idea would be to use the initial runs to provide an improved underlying structure, for the later runs to build upon. ( more similar to the way a sculptor works ,… first chiseling out a rough form before further honing)

Thank You