Roberta pretokenizer - split punctuation?

When using the ByteLevelBPETokenizer to build a tokenizer for a new Roberta model, I found that the tokenizer has quite a few tokens in it which are letters in the alphabet with a period or other punctuation attached. I took a look at the ByteLevelBPETokenizer implementation:

It appears the pretokenizer used is always

tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=add_prefix_space)

The character level one has an option to pretokenize on more than just whitespace:

if split_on_whitespace_only:
    tokenizer.pre_tokenizer = pre_tokenizers.WhitespaceSplit()
else:
    tokenizer.pre_tokenizer = pre_tokenizers.BertPreTokenizer()

Is there a recommended solution in the byte level version which will pretokenize the punctuation separately from the rest of the text? Is that something I should do myself before building the Roberta model (and possibly before creating the transformer itself)?

You can write your own pre-tokenizer or use a pre-tokenizer that will split the text into parts based on the presence of punctuation. For example, you can use BertPreTokenizer, which splits text into tokens, taking punctuation into account. After this you can use ByteLevelBPETokenizer for additional tokenization. This way, you can more precisely control the tokenization process and avoid creating extra tokens that are letters of the alphabet with a period or other punctuation marks. However, keep in mind that the approach may require additional coding and testing to ensure it is effective and suits your needs.

Thanks. That seems like a good approach - so basically I just need to process the text before passing it to the tokenizer builder? It would be a little more convenient for it to be a built in option such as with the Bert training, but it isn’t too hard to work around, at least.