RoBERTa Tokenizer Java Implementation

Hi everyone

I have a RoBERTa model working great in Python and I want to move it to my service - which is written in Java.

For that I need to imitate the RobertaTokenizer Python class - since I didn’t find a Java implementation for it. From what I understand, and I’m pretty new to Transformers, the RobertaTokenizer is similar to SentencePiece but not exactly like it.

I have as reference a Java Tokenizer implementation for CamemBERT which uses SentencePiece, and hugging face documentation says that the CamemBERT tokenizer inherits from the RoBERTa tokenizer.

My question here is, what would be the best way to implement a RoBERTa tokenizer in Java? Can I use the SentencePiece class like used in CamemBERT?

Thanks (;

2 Likes

Hi all
We managed to implement a working RoBERTa tokenizer in Java.
Deployed it to a Maven Central Repoistory:

Hope this can help others :blush:

1 Like