RoBERTa Tokenizer Java Implementation

RazivTri · April 12, 2022, 1:55pm

Hi everyone

I have a RoBERTa model working great in Python and I want to move it to my service - which is written in Java.

For that I need to imitate the RobertaTokenizer Python class - since I didn’t find a Java implementation for it. From what I understand, and I’m pretty new to Transformers, the RobertaTokenizer is similar to SentencePiece but not exactly like it.

I have as reference a Java Tokenizer implementation for CamemBERT which uses SentencePiece, and hugging face documentation says that the CamemBERT tokenizer inherits from the RoBERTa tokenizer.

My question here is, what would be the best way to implement a RoBERTa tokenizer in Java? Can I use the SentencePiece class like used in CamemBERT?

Thanks (;

RazivTri · November 29, 2022, 1:50pm

Hi all
We managed to implement a working RoBERTa tokenizer in Java.
Deployed it to a Maven Central Repoistory:

Hope this can help others

Topic		Replies	Views
Issue with tokenizer.tokenize 🤗Tokenizers	3	503	November 16, 2020
Can we use tokenizer from one architecture and model from another one? Beginners	2	868	September 30, 2021
Tokenizer decoding using BERT, RoBERTa, XLNet, GPT2 Beginners	7	8437	September 21, 2020
Tokenized sequence lengths 🤗Tokenizers	6	2036	March 10, 2022
Pretraining RoBERTa from scratch breaks down when using tokenizer with smaller vocabulary Beginners	2	1677	March 7, 2021

RoBERTa Tokenizer Java Implementation

Related topics