Roberta pretokenizer - split punctuation?

AngledLuffa · March 29, 2024, 5:30am

When using the ByteLevelBPETokenizer to build a tokenizer for a new Roberta model, I found that the tokenizer has quite a few tokens in it which are letters in the alphabet with a period or other punctuation attached. I took a look at the ByteLevelBPETokenizer implementation:

github.com

huggingface/tokenizers/blob/main/bindings/python/py_src/tokenizers/implementations/byte_level_bpe.py

from typing import Dict, Iterator, List, Optional, Tuple, Union

from tokenizers import AddedToken, Tokenizer, decoders, pre_tokenizers, processors, trainers
from tokenizers.models import BPE
from tokenizers.normalizers import Lowercase, Sequence, unicode_normalizer_from_str

from .base_tokenizer import BaseTokenizer


class ByteLevelBPETokenizer(BaseTokenizer):
    """ByteLevelBPETokenizer

    Represents a Byte-level BPE as introduced by OpenAI with their GPT-2 model
    """

    def __init__(
        self,
        vocab: Optional[Union[str, Dict[str, int]]] = None,
        merges: Optional[Union[str, Dict[Tuple[int, int], Tuple[int, int]]]] = None,
        add_prefix_space: bool = False,

This file has been truncated. show original

It appears the pretokenizer used is always

tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=add_prefix_space)

The character level one has an option to pretokenize on more than just whitespace:

if split_on_whitespace_only:
    tokenizer.pre_tokenizer = pre_tokenizers.WhitespaceSplit()
else:
    tokenizer.pre_tokenizer = pre_tokenizers.BertPreTokenizer()

Is there a recommended solution in the byte level version which will pretokenize the punctuation separately from the rest of the text? Is that something I should do myself before building the Roberta model (and possibly before creating the transformer itself)?

KeelyPowers · March 30, 2024, 12:13am

You can write your own pre-tokenizer or use a pre-tokenizer that will split the text into parts based on the presence of punctuation. For example, you can use BertPreTokenizer, which splits text into tokens, taking punctuation into account. After this you can use ByteLevelBPETokenizer for additional tokenization. This way, you can more precisely control the tokenization process and avoid creating extra tokens that are letters of the alphabet with a period or other punctuation marks. However, keep in mind that the approach may require additional coding and testing to ensure it is effective and suits your needs.

AngledLuffa · March 30, 2024, 12:28am

Thanks. That seems like a good approach - so basically I just need to process the text before passing it to the tokenizer builder? It would be a little more convenient for it to be a built in option such as with the Bert training, but it isn’t too hard to work around, at least.

Topic		Replies	Views
Creating a custom tokenizer for Roberta Beginners	5	4322	August 1, 2021
Pretokenise on punctuation except hyphens 🤗Tokenizers	0	292	April 15, 2023
Tokenized sequence lengths 🤗Tokenizers	6	2023	March 10, 2022
Unmasking adds an extra whitespace for BPE tokenizer 🤗Tokenizers	0	272	January 14, 2024
Punctuation and Spaces in RoBERTa Tokenizer for NER with Pre-tokenized Data 🤗Transformers	0	582	January 16, 2022

Roberta pretokenizer - split punctuation?

Related topics