Re-train microsoft/codebert-base tokenizer

Hey, I have also been trying to figure out an implementation to expand CodeBERT to C. Have you come across any method to use a custom tokenizer yet?