Trianing a model using predefined vocab

Hi everyone.

I would like to train from scratch a model that uses as vocabulary a list of predefined tokens that I have. This way the vocab and te tokenizer wouldn’t be created during language modelling, and this training would only require to generate the predefined tokens embeddings. I am not sure of it it is possible, since I suppose if would require some kind of personalized tokenizer.