MiniLMv2-L6-H384-distilled-from-RoBERTa-Large for continual pre-training?

Does anyone know if the MiniLMv2-L6-H384-distilled-from-xxx models are suitable for continual pre-training?
I see they are marked with the Fill-mask tag but their mask predictions on the model card page and when run locally seem to return gibberish.

I really like those models because they are so small and fast (and perform really well) but I’m wondering if I’d be better off switching to distilbert or something else if pretraining with in domain vocabulary is something I wanted to explore.