Re-train microsoft/codebert-base tokenizer

rnty · October 23, 2021, 5:59am

Hello there,

This is my first post.
First of all I want to commend the huggingface team and community for the amazing work they are doing. It simply awesome.
To quickly come to the point, I want to expand the microsoft/codebert-base tokenizer to cover languages like c, cpp and cs.
I have done some research and there seem to be two ways of doing it
i) expand the tokenizer vocabulary
ii) train it from scratch
I see that there are instructions here (blog/how-to-train.md at master · huggingface/blog · GitHub) to do the same. However, it use Byte Level tokenizer and I want to use a custom source code tokenizer(GitHub - dspinellis/tokenizer: Convert source code into numerical tokens).
Please help me with this.

Amapocho · February 3, 2022, 5:49am

Hey, I have also been trying to figure out an implementation to expand CodeBERT to C. Have you come across any method to use a custom tokenizer yet?

Topic		Replies	Views
Replace trained ChatGPT (no coder) Beginners	2	35	June 13, 2025
deBERTa v3 implementation in HuggingFace (with RTD training) 🤗Transformers	5	333	July 12, 2025
Two SEP Tokens added by microsoft/codebert-base Beginners	0	319	August 5, 2022
CodeBert Inferenicing API Beginners	0	228	May 17, 2023
SpanBERT, ELECTRA, MARGE from scratch? Beginners	5	1380	July 22, 2023

Re-train microsoft/codebert-base tokenizer

Related topics