Hello there, This is my first post. First of all I want to commend the huggingface team and community for the amazing work they are doing. It simply awesome. To quickly come to the point, I want to expand the microsoft/codebert-base tokenizer to cover languages like c, cpp and cs. I have done so…

Re-train microsoft/codebert-base tokenizer

Amapocho February 3, 2022, 5:49am 2

Hey, I have also been trying to figure out an implementation to expand CodeBERT to C. Have you come across any method to use a custom tokenizer yet?

Topic		Replies	Views
Custom Tokenizer for source code Beginners	0	443	March 4, 2022
Adding a special language token to MBART 🤗Tokenizers	0	587	November 12, 2022
Train Retry Tokenizer 🤗Tokenizers	0	224	April 18, 2023
How the vocabulary of BERT tokenizer is generated? 🤗Transformers	2	3045	January 6, 2024
Two SEP Tokens added by microsoft/codebert-base Beginners	0	320	August 5, 2022