Can we directly use the embeddings from masked language models?

Hello there,

I have a short conceptual question. I know can train a masked language model from scratch. By doing so with huggingface, I should be able to obtain a model that is very good at … filling the [mask] token!

But what about the embeddings? are they any good for clustering for instance? Note that I am NOT fine-tuning the MLM model in any way. I am only interested in the embeddings that come from the MLM task itself.

Any suggestions or papers greatly appreciated.