This is my first post.
First of all I want to commend the huggingface team and community for the amazing work they are doing. It simply awesome.
To quickly come to the point, I want to expand the microsoft/codebert-base tokenizer to cover languages like c, cpp and cs.
I have done some research and there seem to be two ways of doing it
i) expand the tokenizer vocabulary
ii) train it from scratch
I see that there are instructions here (blog/how-to-train.md at master · huggingface/blog · GitHub) to do the same. However, it use Byte Level tokenizer and I want to use a custom source code tokenizer(GitHub - dspinellis/tokenizer: Convert source code into numerical tokens).
Please help me with this.