Re-train microsoft/codebert-base tokenizer

Hello there,

This is my first post.
First of all I want to commend the huggingface team and community for the amazing work they are doing. It simply awesome.
To quickly come to the point, I want to expand the microsoft/codebert-base tokenizer to cover languages like c, cpp and cs.
I have done some research and there seem to be two ways of doing it
i) expand the tokenizer vocabulary
ii) train it from scratch
I see that there are instructions here (blog/ at master · huggingface/blog · GitHub) to do the same. However, it use Byte Level tokenizer and I want to use a custom source code tokenizer(GitHub - dspinellis/tokenizer: Convert source code into numerical tokens).
Please help me with this.

Hey, I have also been trying to figure out an implementation to expand CodeBERT to C. Have you come across any method to use a custom tokenizer yet?