Build a RoBERTa tokenizer from scratch

Pinging @Narsil